Stanford CS341: Project in Mining Massive Data Sets (Spring 2013) -- Datasets

CS341

Project in Mining Massive Data Sets

Spring 2013

Datasets

When dealing with these datasets please be careful and responsible. The datasets are meant to be used strictly for the purposes of the class project and nothing else. This means: (1) Do not do anything ''funny'' with the dataset; (2) Do not try to break the anonymization; (3) Do not share that data outside the class; (4) do not copy the data off the Amazon EC2; (4) After the class is over destroy all data.

Datasets ''in progress''

These datasets are currently under preparation but you can find rough desciptions at http://bit.ly/CS341DATA.

Stanford CS341 only datasets

Walmart
- [UPDATE] We recently uploaded a new version of the item dataset, containing 3 files: item_main_attributes.tar, item_scs_attributes.tar, and categories.json. The fields are self exlanatory and correspond to the fields in the previous version of the dataset. More specifically:
  - item_main_attributes.tar contains the main attributes of items, and item_scs_attributes.tar contains additional item attributes. Each item in these files is stored on 1 line, in JSON format. There is also a timestamp associated with each item, specifying the day that it was available.
  - The category of each item has a new format now. For example, “0a3920a582107a582797”. This means the category path of this item is 0 -> 3920 -> 582107 -> 582797, where these numbers are department ID, specified in file categories.json.
  - All the questions about this dataset should be addressed to Ba-Quy: bvuong [at] walmartlabs [dot] com.
- Query log data. A collection of ~250M queries on Walmart.com over a certain time period. Each line in the data file is a query in JSON format, which consists of following fields:
  1. visitorid: a unique ID assigned to the visitor using cookie.
  2. wmsessionid: An ID that uniquely identifies a user session.
  3. rawquery: posted query
  4. shownitems: IDs of items which were shown
  5. clickeditems: IDs of items which were clicked
  6. searchattributes: a composite field which specifies the filters applied to the query. There are following sub-fields:
    - facet: filters by facet. E.g., facet=price:$10-$20
    - facets: similar to facet
    - cat_id: the categories that the search result was narrowed down. For example, cat_id=4171_14521 means the search result is first narrowed down to category 4171 and then 14521 (14521 is a sub-category in 4171)
    - search_constraint: the ID of category under which the query was issued. ID=0 means the top category or the entire site
  7. clicks: an array of items which were clicked. Each item consists of:
    - itemid: item ID
    - position: item position in the search result
    - ordered: whether the item was ordered
    - incart: whether the item was added to cart
- [SEE UPDATE ABOVE] Item data. A collection of Walmart items. Each line in the data file is one item in JSON format consisting of following attributes:
  1. item_id: item ID
  2. title: item title
  3. item_type: item type
  4. type: item type
  5. raw_brand: item brand
  6. is_available: whether this item is available
  7. is_media: whether this item is media
  8. is_digital_music: whether this item is digital music
  9. gender: item gender
  10. curr_item_price: item price
  11. static_color: item color
  12. static_size: item size
  13. primary_category_path: item hierarchical categories. E.g., “0a4096a530719a531643” means the item is under hierarchical categories 0, 4096, 530719, 531643 (0 is the top category)
  14. long_description: item description
- Problem themes.
  - Session modeling.
    - Application: personalizing search, improving relevance by query expansion or pseudo relevance feedback.
    - Few examples are:
      - Use the previous search to improve the relevance of the current search e.g. the user previously searched for halogen lights, and now searched for patio lighting, can we this insight to improve the relevancy of the search results?
      - E-Commerce is a closed ecosystem (the clicks all stay within the site) so a user can hop between searching, browsing, viewing item page, checkout, and homepage. Can we model what the user has been doing on other pages to provide better results on the current page?
  - Attribute modeling.
    - Application: how to model the sales of new items based on attribute set and how to estimate an attribute value (e.g. price) based on other attributes.
    - Few examples are:
      - Given a set of product attributes (size, color, brand, etc.) and values (large, blue, levis, etc.) can we predict the sales of an item?
Gild Dataset 1
- Content. This data set consists of x million profiles of software developers. The profiles contain information about the developers education, work history and github profiles. Each profile may contain the following information about a person:
  - education data (schools, degrees, majors, dates)
  - employment data (company, job title, dates)
  - location
  - number of repositories on github
  - number of followers on github
  - list of programming languages they've used on github or answered questions about on stack overflow
  - list of skills the person has listed on social media like linked in
  - a rough score representing their contribution to open source and stack overflow that attempts to rank their talent
  - a list of social media profiles known to be held by the person
- Missing Data. Some data is missing not completely at random. Just because a profile doesn't have education data doesn't mean that person never went to school, it only means that data is not available because they have not listed it or they made it private.
  Example datapoint:
  - Name: ****** (anonymized)
  - Education: [{School: Stanford, Degree: Masters, Major: Computer Science}, {School: Stanford, Degree: Bachelor's, Major: Computer Science}]
  - Work: [{Company: Google, Job Title: Software Developer, From: '2012-01-01', To: 'Current'}]
  - Repositories: 20
  - Followers: 57
  - Languages from opensource and stack overflow: [Java, HTML, Ruby]
  - Skills self listed: [JavaScript, Java, jQuery, postgres]
  - Estimated Skill Score (1-5): 2
- Collection. This data was collected by aggregating social media profiles based on an email. Emails were originally collected from github repositories.
- Projects.
  - "Spot the rock start before they're famous."
    Our database of developers contains employees from some of the top companies in the tech world, big and small. Rank the profiles such that the ones most likely to be hired by top companies are highest. Identify the factors which independentlydrive those rankings.
  - "If you like Jeff Dean, you might also like..."
    You've just founded an incredible new startup and need an equally incredible engineering team. Your top choice just took a job at Google. Recommend awesome alternatives to the one that got away. Find target profiles that best match a source profile along dimensions that matter to hiring.
  - "Angular? Ember? Backbone? What should I learn next?"
    Often listed keywords are poor indicators of developer skills and competence. For example, someone who lists "Flask" and "Jinga" surely knows "Python" and almost certainly knows "Django". Build a system to automatically learn an ontology of skills along with an associated value. Rank profiles based on the likelihood they would be a match for a certain job requisition. If I given you a resume, return the best matching profiles, even if none of the keywords on the two documents match.
Gild Dataset 2
- Content. This data set consists of search queries run by recruiters looking for developers to hire. Each data point consists of an individual search run by a recruiter. An example data point would be::
  - Timestamp: February 18, 2008 14:23
  - Recruiter_id: 34935
  - Location: San Francisco
  - Skills: Java, jQuery
  - Name: Left Blank
  - Score: Left Blank
  - Company: Left Blank
  Not all fields will be filled on every search.
- Projects.
  - "Pre-crime."
    Behavioral analysis of users is crucial not just for UX improvements but also to identify optimal, early intervention points to "rescue" them from future problems. Given our search data, predict future usage patterns in the search data from current ones. Identify from a few initial searches which users are likely to continue using the product. Identify differences in early usage patterns (e.g., growth in search competence as indicated by number of returns) which might predict intervention points for customer ("student") contact.
  - "Bayesian Diagnosis"
    Running full user testing experiments is costly and time consuming. Ideally (if rarely), analyzing user behavior can be done directly from well-designed logging. You do not have that here :) From this dataset, Identify search fields are most important to the search experience and which fields are causing issues.
Economics Course Online Discussion Posts
- Content. This dataset contains 42,688 message posts from online discussions from several sessions of an introduction to economics course. The course is six weeks long and requires two posts per week in response to instructor posted questions. There are two separate tables. The first contains the messages:
  - id: a unique identifier for the message author
  - week: the week of the six week course
  - Message Body: the text of the message
  - Thread Id: a unique identifier for discussion thread, a group of responses to a single question
  - Class Id: a unique identifier for each class of the course
  The second contains the grade information:
  - [id]: each unique identifier for the student is a key
  - grades: the value for each id-key contains cumulative grade information
  Here is an example grade entry:
```
{u'9024614365':
  {
    u'Week 1': {u'WK1': 3.0, u'WK1_max': 5.0},
    u'Week 2': {u'WK2': 8.0, u'WK2_max': 10.0},
    u'Week 3': {u'WK3': 10.0, u'WK3_max': 10.0},
    u'Week 4': {u'WK4': 10.0, u'WK4_max': 10.0},
    u'Week 5': {u'WK5': 18.0, u'WK5_max': 20.0},
    u'Week 6': {u'WK6': 17.0, u'WK6_max': 20.0},
    u'cum_score': 88.0,
    u'cum_score_total': 100.0
  }
}
```
- Projects.
  - "Save the little children."
    Predict which students in danger of failing the class and recommend them for early intervention.
  - "Don't trust experts."
    Build a knowledge map/ontology of economics from the free-form student discussions. Label the map/ontology by normativity to distinguish common misconcepts from from legitimate economics concepts.

Let us know if you need more info on these datasets. We will upload the datasets to EC2.

Other datasets

DBpedia. Richly labeled network containing extracted data from Wikipedia (based on infoboxes). Labeled network of multiple types of nodes and edges About 2.6 million concepts described by 247 million triples, including abstracts in 14 different languages. http://dbpedia.org. Some project ideas:
- Detecting of missing links (and relation types)
- Classification of nodes into the onthology.
Other OpenLinkedData datasets available at http://esw.w3.org/DataSetRDFDumps.
Yannis Antonellis and Jawed Karim offer a file that contains information about the search queries that were used to reach pages on the Stanford Web server. See http://www.stanford.edu/~antonell/tags_dataset.html
SNAP network datasets. 60 large social and information network datasets
Wikipedia
- Complete edit history of Wikipedia articles: Which user edited what article at what time.
- Wikipedia page to page link data
- DBpedia: A richly labeled graph of Wikipedia entities.
- Freebase: An entity graph of people, places and things.
Ratings and purchases (movies, music, etc.)
- Amazon product co-purchasing network: 600k products and all their metadata.
- KDD Cup 2011: 300M ratings from 1M users on 600k songs, albums and artists.
- IMDB database: Everything about every movie ever made.
- Movielens: User movie rating data.
Yahoo! Webscope Catalog of datasets
- Yahoo! Webscope dataset collection. Contains Language Data, Graph and Social Data, Ratings Data, Advertising and Market Data, Competition Data
- Note: Jure Leskovec will have to apply for any sets you want, and we must agree not to distribute them further. There may be a delay, so get requests in early.
1.usa.gov data set
It would enable questions around link propagation, half-life by referrer, geographical analysis, and I'm sure a ton of other fun stuff! 1.usa.gov