Amazon product data

Julian McAuley, UCSD

Description

This dataset contains product reviews and metadata from Amazon, including 143.7 million reviews spanning May 1996 - July 2014.

This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

Files

Complete review data

Please see the per-category files below, and only download these (large!) files if you absolutely need them:

raw review data (20gb) - all 143.7 million reviews

The above file contains some duplicate reviews, mainly due to near-identical products whose reviews Amazon merges, e.g. VHS and DVD versions of the same movie. These duplicates have been removed in the two files below:

user review data (18gb) - duplicate items removed (83.31 million reviews), sorted by user

product review data (19gb) - duplicate items removed, sorted by product

Finally, the following file removes duplicates more aggressively, removing duplicates even if they are written by different users. This accounts for users with multiple accounts or plagiarized reviews. Such duplicates account for less than 1 percent of reviews, though this dataset is probably preferable for sentiment analysis type tasks.

aggressively deduplicated data (18gb) - no duplicates whatsoever (83.08 million reviews)

Format is one-review-per-line in (loose) json. See files below for further help reading the data.

Sample review:

{ "reviewerID": "A2SUAM1J3GNN3B", "asin": "0000013714", "reviewerName": "J. McDonald", "helpful": [2, 3], "reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!", "overall": 5.0, "summary": "Heavenly Highway Hymns", "unixReviewTime": 1252800000, "reviewTime": "09 13, 2009" }

where

Metadata

Metadata includes descriptions, price, sales-rank, brand info, and co-purchasing links:

metadata (1.9gb) - metadata for 9.4 million products

Sample metadata:

{ "asin": "0000031852", "title": "Girls Ballet Tutu Zebra Hot Pink", "price": 3.17, "imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg", "related": { "also_bought": ["B00JHONN1S", "B002BZX8Z6", "B00D2K1M3O", "0000031909", "B00613WDTQ", "B00D0WDS9A", "B00D0GCI8S", "0000031895", "B003AVKOP2", "B003AVEU6G", "B003IEDM9Q", "B002R0FA24", "B00D23MC6W", "B00D2K0PA0", "B00538F5OK", "B00CEV86I6", "B002R0FABA", "B00D10CLVW", "B003AVNY6I", "B002GZGI4E", "B001T9NUFS", "B002R0F7FE", "B00E1YRI4C", "B008UBQZKU", "B00D103F8U", "B007R2RM8W"], "also_viewed": ["B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", "B00D23MC6W", "B00AFDOPDA", "B00E1YRI4C", "B002GZGI4E", "B003AVKOP2", "B00D9C1WBM", "B00CEV8366", "B00CEUX0D8", "B0079ME3KU", "B00CEUWY8K", "B004FOEEHC", "0000031895", "B00BC4GY9Y", "B003XRKA7A", "B00K18LKX2", "B00EM7KAG6", "B00AMQ17JA", "B00D9C32NI", "B002C3Y6WG", "B00JLL4L5Y", "B003AVNY6I", "B008UBQZKU", "B00D0WDS9A", "B00613WDTQ", "B00538F5OK", "B005C4Y4F6", "B004LHZ1NY", "B00CPHX76U", "B00CEUWUZC", "B00IJVASUE", "B00GOR07RE", "B00J2GTM0W", "B00JHNSNSM", "B003IEDM9Q", "B00CYBU84G", "B008VV8NSQ", "B00CYBULSO", "B00I2UHSZA", "B005F50FXC", "B007LCQI3S", "B00DP68AVW", "B009RXWNSI", "B003AVEU6G", "B00HSOJB9M", "B00EHAGZNA", "B0046W9T8C", "B00E79VW6Q", "B00D10CLVW", "B00B0AVO54", "B00E95LC8Q", "B00GOR92SO", "B007ZN5Y56", "B00AL2569W", "B00B608000", "B008F0SMUC", "B00BFXLZ8M"], "bought_together": ["B002BZX8Z6"] }, "salesRank": {"Toys & Games": 211836}, "brand": "Coxlures", "categories": [["Sports & Outdoors", "Other Sports", "Dance"]] }

where

Visual Features

We extracted visual features from each product image using a deep CNN (see citation below). Image features are stored in a binary format, which consists of 10 characters (the product ID), followed by 4096 floats (repeated for every product). See files below for further help reading the data.

visual features (141gb) - visual features for all products

Per-category files

Below are files for individual product categories, which have already had duplicate item reviews removed.

Books reviews metadata image features
Electronics reviews metadata image features
Movies and TV reviews metadata image features
CDs and Vinyl reviews metadata image features
Clothing, Shoes and Jewelry reviews metadata image features
Home and Kitchen reviews metadata image features
Kindle Store reviews metadata image features
Sports and Outdoors reviews metadata image features
Cell Phones and Accessories reviews metadata image features
Health and Personal Care reviews metadata image features
Toys and Games reviews metadata image features
Video Games reviews metadata image features
Tools and Home Improvement reviews metadata image features
Beauty reviews metadata image features
Apps for Android reviews metadata image features
Office Products reviews metadata image features
Pet Supplies reviews metadata image features
Automotive reviews metadata image features
Grocery and Gourmet Food reviews metadata image features
Patio, Lawn and Garden reviews metadata image features
Baby reviews metadata image features
Digital Music reviews metadata image features
Musical Instruments reviews metadata image features
Amazon Instant Video reviews metadata image features

Citation

Please cite the following if you use the data in any way:

Image-based recommendations on styles and substitutes
J. McAuley, C. Targett, J. Shi, A. van den Hengel
SIGIR, 2015
draft

Code

Reading the data

Data can be treated as python dictionary objects. A simple script to read any of the above the data is as follows:

def parse(path): g = gzip.open(path, 'r') for l in g: yield eval(l)

Convert to 'strict' json

The above data can be read with python 'eval', but is not strict json. If you'd like to use some language other than python, you can convert the data to strict json as follows:

import json def parse(path): g = gzip.open(path, 'r') for l in g: yield json.dumps(eval(l)) f = open("output.strict") for l in parse("reviews_Video_Games.json.gz"): f.write(l + '\n')

Read image features

import struct def readImageFeatures(path): f = open(path, 'rb') while True: asin = f.read(10) if asin == '': break feature = [] for i in range(4096): feature.append(struct.unpack('f', f.read(4))) yield asin, feature

Example: compute average rating

ratings = [] for review in parse("reviews_Video_Games.json.gz"): ratings.append(review['overall']) print sum(ratings) / len(ratings)