Amazon product data
Julian McAuley, UCSD
This dataset contains product reviews and metadata from Amazon, including 143.7 million reviews spanning May 1996 - July 2014.
This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).
Complete review data
Please see the per-category files below, and only download these (large!) files if you absolutely need them:
raw review data (20gb) - all 143.7 million reviews
The above file contains some duplicate reviews, mainly due to near-identical products whose reviews Amazon merges, e.g. VHS and DVD versions of the same movie. These duplicates have been removed in the two files below:
user review data (18gb) - duplicate items removed (83.31 million reviews), sorted by user
product review data (19gb) - duplicate items removed, sorted by product
Finally, the following file removes duplicates more aggressively, removing duplicates even if they are written by different users. This accounts for users with multiple accounts or plagiarized reviews. Such duplicates account for less than 1 percent of reviews, though this dataset is probably preferable for sentiment analysis type tasks.
aggressively deduplicated data (18gb) - no duplicates whatsoever (83.08 million reviews)
Format is one-review-per-line in (loose) json. See files below for further help reading the data.
- reviewerID - ID of the reviewer, e.g. A1RSDE90N6RSZF
- asin - ID of the product, e.g. 0000013714
- reviewerName - name of the reviewer
- helpful - helpfulness rating of the review, e.g. 2/3
- reviewText - text of the review
- overall - rating of the product
- summary - summary of the review
- unixReviewTime - time of the review (unix time)
- reviewTime - time of the review (raw)
Metadata includes descriptions, price, sales-rank, brand info, and co-purchasing links:
metadata (1.9gb) - metadata for 9.4 million products
- asin - ID of the product, e.g. 0000031852
- title - name of the product
- price - price in US dollars (at time of crawl)
- imUrl - url of the product image
- related - related products (also bought, also viewed, bought together, buy after viewing)
- salesRank - sales rank information
- brand - brand name
- categories - list of categories the product belongs to
We extracted visual features from each product image using a deep CNN (see citation below). Image features are stored in a binary format, which consists of 10 characters (the product ID), followed by 4096 floats (repeated for every product). See files below for further help reading the data.
visual features (141gb) - visual features for all products
Below are files for individual product categories, which have already had duplicate item reviews removed.
Please cite the following if you use the data in any way:
Image-based recommendations on styles and substitutes
J. McAuley, C. Targett, J. Shi, A. van den Hengel
Reading the data
Data can be treated as python dictionary objects. A simple script to read any of the above the data is as follows:
Convert to 'strict' json
The above data can be read with python 'eval', but is not strict json. If you'd like to use some language other than python, you can convert the data to strict json as follows:
Read image features
Example: compute average rating