Dataset of White House speech quotes by news outlets (.json version) ==================================================== This dataset contains a collection of quotes by news outlets, their location within the source White House speech, and information about the article in which the quotes were cited. Code for finding and processing matching quotes is here: https://github.com/tisjune/whitehouse-transcripts URL: http://snap.stanford.edu/quotus/#data Paper: http://www.cs.cornell.edu/~cristian/Structure_of_Political_Media_Coverage_files/quoting_patterns.pdf Authors: Vlad Niculae Caroline Suen Justine Zhang Cristian Danescu-Niculescu-Mizil Jure Leskovec Version: 1.0 Contact: justinez@stanford.edu Last updated: March 3, 2015 The dataset is described further in our paper: Vlad Niculae, Caroline Suen, Justine Zhang, Cristian Danescu-Niculescu-Mizil, Jure Leskovec QUOTUS: The Structure of Political Media Coverage as Revealed by Quoting Patterns In: Proceedings of WWW 2015 ==================================================== Stats ==================================================== Size (compressed): 41M Size (uncompressed): 155M Number of quote matches: 327663 Format ==================================================== JSON file with following schema: { ClId: { ClSz, TotFq, Src, Speaker, QtId: { QtFq, Text, Sim, PNums, Align, MId: { AId, Tm, Title, URL, IsDup } } } } : Cluster id : Number of different quotes in the cluster : Total number of matches of all quotes in the cluster : Name of White House transcript where cluster originated : Quote speaker : Quote id : Number of matches found for quote : text of quote : Similarity score (0 is a perfect match) : Index of paragraphs in White House transcript where quote appears, as comma-separated list of numbers : Alignment of quote to transcript. Each quote is split up into the paragraphs in which they appear. We assume that a quote spans multiple paragraphs if it contains '...'. Alignments are lists of word indices structured as follows: a1,a2,...am;b1,b2,...bn; etc Each sequence of semicolon-separated numbers corresponds to one paragraph index P in PNums = p1,p2,... Hence in the example above, the first paragraph of the quote contains words p1[a1],p1[a2],...p1[am] of the corresponding transcript. A word index equal to -1 means that the particular word was not matched to any word in the transcript. Consecutive word indices which have a2-a1 > 1 means that the quote skipped a word in the transcript. For quotes which span multiple paragraphs, we cluster the quote according to the longest segment. : Match id : Article ID of article in which quote appeared (as (integer id, year)) : Timestamp of article (%d%m%Y) : Title of article <URL>: URL of article (can be inactive) <IsDup>: Identifies whether or not article is original content: 'Nonwire': Original 'AP': copied from the Associated Press wire 'Reuters': copied from Reuters * Note that data is also available in tab-separated format, available here: http://snap.stanford.edu/quotus/#data Remarks ==================================================== While our paper only used quotes which came from speeches where Obama was the speaker, the dataset contains quotes from all speakers. White House transcripts can be downloaded separately: http://snap.stanford.edu/quotus/#data Contact us regarding access to the complete texts of the news articles.