Dataset of postprocessed White House speech transcripts
====================================================

This dataset contains a collection of preprocessed White House speeches and press briefings. The transcript collection contains speeches up to October 2014. 

Code for fetching transcripts is here: https://github.com/tisjune/whitehouse-transcripts

URL: http://snap.stanford.edu/quotus/#data
Paper:
http://www.cs.cornell.edu/~cristian/Structure_of_Political_Media_Coverage_files/quoting_patterns.pdf

Authors:
	Vlad Niculae
	Caroline Suen
	Justine Zhang
	Cristian Danescu-Niculescu-Mizil
	Jure Leskovec

Version: 1.0

Contact: justinez@stanford.edu

Last updated: March 3, 2015

The dataset is described further in our paper:
	Vlad Niculae, Caroline Suen, Justine Zhang, Cristian Danescu-Niculescu-Mizil, Jure Leskovec
	QUOTUS: The Structure of Political Media Coverage as Revealed by Quoting Patterns
	In: Proceedings of WWW 2015

====================================================

Stats
====================================================

Size (compressed): 33M
Size (uncompressed): 135M
Number of transcripts: 4525

Format
====================================================

The data is in json format with the following schema:

{	
	'id': 
		{
			'title': title of speech,
			'date': date of speech (%d%m%Y),
			'paragraphs':
				[
					'words': [array of words],
					'speaker': name of speaker
				]
		}
}


Remarks
====================================================

Note that some speakers have multiple names - eg. THE PRESIDENT; PRESIDENT OBAMA.

Our paper used a subset of the transcript data consisting of speeches where Obama was the speaker. Note that one speech can contain multiple speakers (in this case, we took only the excerpts spoken by Obama).

Speech transcripts were taken from  
www.whitehouse.gov/briefing-room/speeches-and-remarks
and 
http://www.whitehouse.gov/briefing-room/press-briefings.