Dataset of postprocessed White House speech transcripts ==================================================== This dataset contains a collection of preprocessed White House speeches and press briefings. The transcript collection contains speeches up to October 2014. Code for fetching transcripts is here: https://github.com/tisjune/whitehouse-transcripts URL: http://snap.stanford.edu/quotus/#data Paper: http://www.cs.cornell.edu/~cristian/Structure_of_Political_Media_Coverage_files/quoting_patterns.pdf Authors: Vlad Niculae Caroline Suen Justine Zhang Cristian Danescu-Niculescu-Mizil Jure Leskovec Version: 1.0 Contact: justinez@stanford.edu Last updated: March 3, 2015 The dataset is described further in our paper: Vlad Niculae, Caroline Suen, Justine Zhang, Cristian Danescu-Niculescu-Mizil, Jure Leskovec QUOTUS: The Structure of Political Media Coverage as Revealed by Quoting Patterns In: Proceedings of WWW 2015 ==================================================== Stats ==================================================== Size (compressed): 33M Size (uncompressed): 135M Number of transcripts: 4525 Format ==================================================== The data is in json format with the following schema: { 'id': { 'title': title of speech, 'date': date of speech (%d%m%Y), 'paragraphs': [ 'words': [array of words], 'speaker': name of speaker ] } } Remarks ==================================================== Note that some speakers have multiple names - eg. THE PRESIDENT; PRESIDENT OBAMA. Our paper used a subset of the transcript data consisting of speeches where Obama was the speaker. Note that one speech can contain multiple speakers (in this case, we took only the excerpts spoken by Obama). Speech transcripts were taken from www.whitehouse.gov/briefing-room/speeches-and-remarks and http://www.whitehouse.gov/briefing-room/press-briefings.