Visualizations and plots
Top phrases between August 2008 and February 2009. | |
Phrases containing "economy". |
Top phrases in the last month. |
Various variants of the "palling around with terrorists" phrase. |
Phrase graph and clusters:
Parts of the phrase graph:
- "Palling around with terrorists": Part of the phrase graph containing variants of the phrase "Palling around with terrorists". Each box represents a variant of a phrase that appeared on the web and edges link phrases that share many words.
- "It's going to be a president's job to deal with more than one thing at once." Part of the phrase graph containing variants of the phrase. Each box represents a variant of a phrase that appeared on the web and edges link phrases that share many words.
Phrase clusters identified by our DAG partitioning algorithm:
- "Lipstick on a pig": a phrase cluster with 61 different variants of the "Lipstick on a pig" phrase.
- "The fundamentals of our economy are strong" phrase cluster.
Number of documents, words and phrases over time:
- Number of articles produced is about constant over time.
- Number of words in articles is about constant over time.
- Number of extracted phrases in articles is about constant over time.
Comparison to baseline techniques:
Here we compare our meme-tracking techniques to simple baseline approaches for topic tracking and information cascade identification.
- Top 50 most frequent words and named entities (after heavy preprocessing, after stopword removal, and thresholding on document frequency).
- Top 5 most frequent words and named entities (after heavy preprocessing, stopword removal, and thresholding on document frequency).
Now few patterns can be observed: Palin and McCain diminish over time, Obama peaks on the election day of November 4.
- Top 50 most in-linked documents. Most documents receive links for short periods of time.
- 50 topics from LDA(Latent Dirichlet Allocation). Notice as the topics broadly resemble elections, social media and bogging, war in Iraq and so on, the changes in vocabulary are not significant and thus most clusters are quite stable over time. Since we had scalability issues we took top 10,000 most in-linked documents and applied aggressive stopword removal (minimum word frequency is 40). To find 100 topics LDA took 2 days to run.
- Top 50 unclustered phrases: raw quoted phrases. No clustering. Notice how no phrase gains significant volume as its appearances are scattered around in many different variations.