Data Geek Challenge: Analyzing Amazon Product Reviews with SAP Lumira
I am not a crazy online shopper like some of my friends, but I found it very useful and interesting to read some of the customer product reviews posted by online retailers like Amazon.com. One time on a bus heading back home after work, a lady sitting next to me was reading a book “To the Lighthouse” by Virginia Woolf which immediately aroused my curiosity. To see if it could be my next vacation book, I went to Amazon.com to see what other people say about the book. Amazon.com did a good job by collecting customer reviews on products, but, in terms of analyzing and presenting those reviews, it didn’t give me too much other than showing a simple “star” column chart with a couple of the most helpful favorable/critical ones followed by listing all of the reviews (see below figure).
For products which come with hundreds or thousands of reviews, it is just too difficult for human beings to read through all the reviews to analyze them. Data discover tools like SAP Lumira can be very helpful for this kind of insight hunting.
From a business point of view, not only end consumers, but also product designers and product sellers are very interested in getting any deeper insights from customer product reviews.
Two data sources are used for this blog post (although I collected a bunch of customer product reviews from Amazon.com):
- Product Reviews: To the Lighthouse: Amazon.com
- Amazon.com: Customer Reviews: Happy Camper Two Person Tent With Carry Bag
Below figure shows their data volume size:
There are three challenges in my tour to use SAP Lumira to analyze Amazon customer product reviews.
- Data extraction
My first challenge is to extract the data from Amazon.com web site into a format that Lumira could look. After some online research, it didn’t seem very straightforward to find APIs to easily get the review data by the public. To avoid spending too much time on this, I ended up writing a small VBA (Visual Basic for Applications) script inside MS Excel to automate Internet Explorer to fetch the web pages and parse the review data directly into Excel sheets which could be easily fed into Lumira. Below figure shows the simple frontend GUI of the script:
- Sentiment analysis
To my knowledge, at the time when this blog post was written, Lumira provides little sentiment analysis capabilities. Again, to make it simple, I ended up writing another small VBA script to add a lexical level sentiment analysis algorithm which is based on the research paper “Language-independent Bayesian sentiment mining of Twitter” by Alex and Zoubin. In order for Lumira to analyze the lexical sentiment, I created a second dataset by breaking down the review dataset into rows of words, followed by a merge (join back) with the review dataset. This operation sometimes results in a large dataset which gives me the next challenge.
- Large dataset
The Lumira Desktop I used is the free download version 1.13.0. There is a limitation to use it with large dataset. When performing sentiment analysis on “To the Lighthouse”, I bumped into a performance bottle net which I cannot overcome very easily. To continue my insight hunting journey, I have to pick another data source – customer reviews on “Happy Camper Two Person Tent With Carry Bag” – which has a smaller dataset but just good to illustrate some of the potential insight that could be seen on “To the Lighthouse”, any other reviews, or text contents.
Result Part I: Interesting Insights for “To the Lighthouse”
- The majority of the reviews came from Paperback. Wow, Ebook still doesn’t catch up.
- Paperback reviewers share the same pattern with Ebook reviewers, but Audio reviewers have a quite different pattern (less happy).
- Top 100/500/1000 Reviewers all gave 5 star ratings. Note: Badge is a symbol that Amazon.com gave to the reviewers who earn top reviewer badges by writing good quality reviews.
- Most of the U.S. reviewers came from the four cornners of the country. Do these states have larger population or do they have more time to read a novel?
- Reviewers near occeans seem to be happier to the book than those far from water. Interesting! Could this be related to the fact that the story of the novel is all about visits to a Scotland island?
- West coast reviewers seem to be happier to the book than those on the east coast. Why?
- Instead of reading all the reviews, I only want to read the top 5 or 10 reviews (but top 1 is too less):
Result Part II: Interesting Insights for “Happy Camper Two Person Tent With Carry Bag”
- The most frequently used word in all reviews is “TENT” – not a surprise as we are looking at a tent. Next to it are “SO, USE, SMALL, GOOD, ZIPPER, HAVE, IF, ONE, SET, VERY, EASY, CAMPING”, etc..
- The number of occurrence of “TENT” is three times more than that of the next word.
- In below figure I rendered the most positive (happy) words with green color and the most negative (sad) words with red color. It turns out the happy words are much more than the sad words in the reviews.
- In the center there are a cluster of neutral words which are surrounded by some happy words. Sad words are scattered towards to the edge. This figure suggests on average the customers are more likely to have a positive feeling about the product.
- Below figure indicates the same. Sad words (sentiment < 0.5) are on the left side. Happy words (sentiment > 0.5) are on the right side. Neutral words (sentiment = 0.5) are in between. The high peak in red is the word “TENT” which falls under the neutral area.
- Filter only the most positive words:
- Filter only the most negative words:
- If anyone wants to summarize the reviews of the product, the words of the leftmost column are good candidates. Then a summary could be something like “a very good and easy to use tent for camping”.
SAP Lumira did a good job to help me to see, imagine, and show the review data I got from Amazon.com, although in some areas it could do better to reduce my challenges. It is a good sign to see that extensibility and big dataset support features surface out on the roadmap of Lumira. So, it is probably a good time to write a blog post here in response to the data geek challenge, and as a stop on the trip before the busy pre-Christmas season came. However, my journey to seek the insight with the help of SAP Lumira is far from its destination, especially when seeing there are a lot of potentials in areas such as sentiment analysis at higher levels.
I also attached the datasets used in this blog post (the review data I collected from Amazon.com plus the calculated sentiments) for those who are interested in following this blog post. To use them, remove the .txt file extension after unzip.