Visualizations and plots

Top phrases between August 2008 and February 2009.
Interactive visualization.

Phrases containing "economy".
Interactive visualization.

Top phrases in the last month.
Interactive visualization.

Number of mentions of "A changing environment will affect alaska more than any other state because of our location i'm not one though who would attribute it to being man-made." over time.
Notice that the phrase was first said long time before it was re-discovered by mass media and bloggers.

Number of mentions of "Bristol and the young man she will marry are going to realize very quickly the difficulties of raising a child which is why they will have the love and support of our entire family." over time.
Notice a quick spike in the number of mentions followed by a fast decay.

Various variants of the "palling around with terrorists" phrase.
Each box represents a variant of a phrase that appeared on the web and edges link phrases that share many words.

Various variants of the "It's going to be a president's job to deal with more than one thing at once." phrase.
Each box represents a variant of a phrase that appeared on the web and edges link phrases that share many words.

Various variants of the "you can put lipstick on a pig" quote. The width of the line is proportional to the number of mentions.
Notice how certain subphrases of the long phrase become more/less mentioned than the original long phrase.

Phrase graph and clusters:


Parts of the phrase graph:

Phrase clusters identified by our DAG partitioning algorithm:

Number of documents, words and phrases over time:


Notice daily and weekly periodicities but the total amount of new articles, and words and phrases in those articles is about constant over time.

Comparison to baseline techniques:


Here we compare our meme-tracking techniques to simple baseline approaches for topic tracking and information cascade identification.

  • Top 50 most frequent words and named entities (after heavy preprocessing, after stopword removal, and thresholding on document frequency).
  • Top 5 most frequent words and named entities (after heavy preprocessing, stopword removal, and thresholding on document frequency).
    Now few patterns can be observed: Palin and McCain diminish over time, Obama peaks on the election day of November 4.
  • Top 50 most in-linked documents. Most documents receive links for short periods of time.
  • 50 topics from LDA(Latent Dirichlet Allocation). Notice as the topics broadly resemble elections, social media and bogging, war in Iraq and so on, the changes in vocabulary are not significant and thus most clusters are quite stable over time. Since we had scalability issues we took top 10,000 most in-linked documents and applied aggressive stopword removal (minimum word frequency is 40). To find 100 topics LDA took 2 days to run.
  • Top 50 unclustered phrases: raw quoted phrases. No clustering. Notice how no phrase gains significant volume as its appearances are scattered around in many different variations.