Data Geek Challenge III – Analyzing a Novel Dataset
SAP Lumira is this nifty tool for data exploration released by the SAP, and as always, they do not fail to impress. A few months ago, I had the chance to work with it, albeit only for a little, and now I have this great opportunity (and excuse) to look into it again – the Data Geek III Challenge.
Right from when I was a kid, I’ve always loved reading, although I mostly stick only to fiction. Hence, I decided to analyze a ‘novel’ dataset, consisting of universally acclaimed books. In 2003, the British Broadcasting Corporation (BBC) conducted a survey to find the most popular of novels and came up with a list. I have used the top 50 of those to create my dataset. For the entire list (which makes for pretty excellent reading), feel free to visit this link http://www.bbc.co.uk/arts/bigread/top100.shtml
After much consideration, I chose to pledge my allegiance to the House of Titans, as I firmly believe that a good book is the best source of entertainment there is. And now, on to my story.
Having only a list of 50 books and their respective authors, I first had to create my dataset. I thought of factors that might come into play while discussing bestsellers and came up with a few parameters, and set out to collect the information. Sad to say, I was not able to find all that I had hoped to, and my dataset is still missing some data for a few books.
Next, I opened Lumira and loaded the Excel file onto which I’d stored my data. The tool gives you the option of cleaning and modifying your data for a meaningful analysis in the Prepare tab, which is very useful, especially when you want to make changes to the data, but don’t have to go back to the source file to do it.
After preparing the data came the best part – visualizing! In the Visualize tab, there is a plethora of charts to pick and choose from, and how to map your data to it. I set about creating visualizations, hoping to find answers along the way.
My first chart was to depict the distribution of these titles by language. I’ve used a pie chart, and the results are not surprising, with the English language reigning supreme.
When it comes to genre, people seem to favor fantasy over others.
Sorting the books by the year they were published in, it comes as no surprise that most of them are the good old classics.
Let us now look at the authors, the people behind such wondrous works of art. I’ve split them based on their nationality and gender.
Most authors are British, with Americans coming in at a distant second. I’ve used a bar chart to visualize this information, but thanks to Lumira, I can use a more insightful visualization – the Chloropleth map.
Splitting the titles by the author’s gender shows that male authors are featured more on the list than their female counterparts.
Although the BBC had proclaimed these 50 novels to be the most popular, I wanted to see for myself if they were all that great. I checked their ratings (obtained from goodreads.com) and the results come as no surprise, with all titles having fairly high ratings.
And most of them have been and still are best-sellers, having sold millions of copies, translated into various languages. Refer to the Tag Cloud below, with the Count of Monte Cristo at 200 million copies sold.
It is only natural that a good book is made into a movie, and the bulk of these novels have been featured on the silver screen.
However, not all books turn into great movies. Some of them do not perform as well at the box office, while some others go on to break records.
In some cases, the higher rating of a book has helped it gross higher. Check out this scatter plot, where the Lord of the Rings is an outlier, with almost 3 billion dollars in revenue, and currently enjoying a rating of 4.43 on 5.
With that, I wrap up my analysis. Lumira is truly a fun and easy-to-use tool, and many thanks to the SAP for enabling users to explore, visualize and share their findings in such a simple manner. I look forward to working more on Lumira, and to submit a much better entry in the next year’s edition of the Data Geek Challenge!
P.S:
I’ve read only 17 books on this list. On the bright side, I still have 33 more great novels to read. How many of them have you read?