The real-time information on news sites, blogs and social networking sites changes dynamically and spreads rapidly through the Web. Developing methods for handling such information at a massive scale requires that we think about how information content varies over time, how it is transmitted, and how it mutates as it spreads.
NIFTY is a system that finds mutations of a single piece of information across the daily news cycle. Based on Memetracker, each day, the system parses through 3.5 million news articles and 2 million mentioned quotes to find the top clusters of quotes through a process called incremental clustering. Incremental meme-clustering is a novel highly-scalable approach to efficiently extract and identify mutational variants of a single meme. NIFTY runs orders of magnitude faster than our previous Memetracker system, while also maintaining better consistency and quality of extracted memes.
We demonstrate the effectiveness of our approach by processing a 20 terabyte dataset of 6.1 billion blog posts and news articles that we have been continuously collecting for the last four years. NIFTY extracted 2.9 billion unique textual phrases and identified more than 9 million memes. Our meme-tracking algorithm was able to process the entire dataset in less than five days using a single machine.