DG2: Data Geek Challenge 2
DG2: Data Geek Challenge 2
- @SCNBlogs twitter timeline dataset
- The best day of the week to appear on @SCNBlogs timeline
- The best day of the week to publish an SCN blog
- Where are the tweeps clicking on the @SCNblogs tweets
- SCN Blogger with the most retweets and favourites on the @SCNBlogs timeline
- SCN Blogger with the most favourited tweet on @SCNblogs
- SCN Blogger with the most @SCNblogs page views
- @SCNBlogs blogs with the most page views
- A place for bloggers
- Thank you
- Data Quality
- One last thing to do
After entering the first Data Geek Challenge with an entry based on SDN Points and learning some things along the way, I thought I would enter the DG2 contest. This time using data based on SCN. My chosen data set was created from the @SCNblogs twitter feed. The intention was to try and identify the best day/time to publish a blog and best day/time to tweet about it. Searching Google for “best time to tweet” and “best time to blog” returns lots of results with suggestions to improve “click through ratio“. The following will show my attempts to use my sample data from @SCNblogs timeline to determine if I could get the answers from clicks on links and views of blogs. One item to note, this is based purely on what blogs appears on the @SCNblogs twitter timeline and not the main SCN blog site. Although the @SCNblogs twitter account is automated some blogs do not appear on the timeline. I will start with a short description on the process to gather the data for my work with Lumira.
@SCNBlogs twitter timeline dataset
For the Data Geek 1 contest, I had used the data from the old SDN RSS feeds which I had collected via an ABAP report (and blogged about here). This time out I thought I would again collect the data via SAP software. My intention was to learn more about a variety of SAP software in the process of entering the Data Geek 2 contest with Lumira. I used SQL Anywhere to read the @SCNblogs twitter timeline by using twitter and bitly APIs and using SAPUI5 to format 3 months worth of data into a table, I also picked up some Oauth knowledge along the way. This data was copied into Excel for formatting and then used in Lumira. The SCNblogs tweets usually contain a bit.ly url and appear as follows.
Example @SCNblogs tweet.
BusinessObjects Design Studio Question and Answer – ASUG
I have highlighted the bit.ly link in the above text. An example of the statistics bitly provides can be found here. I combined the two using SQL Anywhere and SAPUI5 using the twitter & bitly APIs. I used the bitly api http://dev.bitly.com/link_metrics.html#v3_link_referring_domains to calculate the total number of clicks and those clicks where twitter was the source of the click. Looks like this in SAPUI5.
That is the basics of the data collection process and I can share my CSV file if you would like to take a look.
Now onto how I used Lumira to answer my questions, “best day/time to publish a blog on SCN?” and “what time to tweet about that blog?” I also put some names to the blogs @SCNblogs tweet about using the bit.ly api.
The best day of the week to appear on @SCNBlogs timeline
The following is a chart of clicks on the bitly links via twitter by day.
Closer than I expected and I was slightly surprised to see Friday being the top day. However my sample data over the last 3 months indicates that tweets appearing on a Friday have the most clicks on the bitly link from twitter. Next step was for me to filter into Friday and check the best time of day to get those twitter clicks.
Again I was slightly surprised to see 14:00 being the peak time. The chart is not showing the actual time of the click but the time the tweet was published. According to the bitly blog http://blog.bitly.com/post/9887686919/you-just-shared-a-link-how-long-will-people-pay the “half life” of a general bitly link is 3 hours. So 14:00- 17:00 peak time for clicks on the link.
Although what is a typical day for people on SCN, as it is worldwide and covers many timezones. From the twitter API the “time_zone”: “Berlin”
Therefore I have my time to tweet: Friday 14:00.
The best day of the week to publish an SCN blog
The view count of the SCN blogs (via @SCNblogs) will be used for this next section. As part of my initial collection of @SCNblogs twitter timeline I was left with the scn bit.ly URL and text, which was what I was after but I thought more detail was required. Therefore I used SQL Anywhere to expand the bitly link to its original state with the bitly api, then I was able to scrape out the user and view data of the source blogs. Now the assumption being that the creation date is the published date i.e. a draft blog will be updated with the published data/time as the creation date. I will find out when my blog hits the SCN blog space.
A blog created(published) on a Wednesday is the day with the most views from the last 3 months. So again the next step would be to drill down into Wednesday to find the best time of day.
The time format was in AM/PM from SCN blogs Jive site. So although “9” has the highest overall views , this is for AM/PM combined. So 1pm is the best time (of publication) for SCN blogs for views over my dataset.
So I have my time to publish a blog: Wednesday 13:00
Again the data is for the time the blog was published and not the time it was viewed.
Also to note there is a filter on “unknown” data in the SCN blogs charts as there is a lot of content being moved or removed from their original url after being tweeted. I was a bit concerned I had some bad connections but whether I logged in or not, I was presented with this many times.
I have my answers to my original questions so now for some further analysis of the data.
Where are the tweeps clicking on the @SCNblogs tweets
Another aspect of the BITLY api is that it allows you to query the location of the user clicking on the link. The below chart shows all the countries where users are clicking on @SCNblogs tweets over the full 3 month date range.
I’m impressed at the number of countries clicking on the @SCNblogs tweet links.
From the process of combining the SCN blogs information to the tweets I could attach an SCN user name to the @SCNblogs tweets. This allows me to do the following.
SCN Blogger with the most retweets and favourites on the @SCNBlogs timeline
For the next chart I added a calculated measure of the number of retweets and favourites on the @SCNblogs timeline from my 3 month dataset.
Above screenshot shows the creation of the new measure. By clicking the plus icon I typed the first few characters of the existing measures and was provided a dropdown to select from. Once I added the selection I had the SCN bloggers behind the @SCNblogs retweets and favourites. Tammy Powlas being the top blogger in this category with over 100 retweets or favourites for the @SCNblogs tweets.
I added a count of clicks to find out how many times these bloggers had blogged.
Again I was impressed to find Tammy had blogged 45 times over the 3 month period I chose. The blue column indicating over 100 favourites and retweets and red column indicating 45 blogs for Tammy appearing on the @scnblogs timeline! Tammy has even more blogs over the same 3 month period on the main SCN blog space!
SCN Blogger with the most favourited tweet on @SCNblogs
I added the individual @SCNblog tweet text and the SCN blogger name to find out which tweet had the most favourites on twitter.
The blue column shows the favourited tweets and the red column the number of retweets.
So Paul Aschmann had the most favourited tweet in my 3 month data set from @SCNblgs. Apolgies to Jürgen L, as some corruption in the data along the way from twitter to here.
SCN Blogger with the most @SCNblogs page views
This is my data set and not any official data provided by SAP. My data is a snapshot in time from the beginning of May to the start of August reading from the @scnblogs timeline. I do remember when SCN had a rolling top blog charts and even tweeted about that to @IvanFemia when one of his blogs hit the top spot. I can’t find such a list now on SCN.
Top 10 bloggers with the most page views on @SCNblogs timeline are
3rd May – 6th Aug
Tammy Powlas again top of the charts with over 45,000 page views. *only those blogs appearing in @scnblogs timeline.
Again adding in the count to check the number of blogs (I know Tammy will have blogged 45 times from earlier )
The red column on the y axis is a count of @scnblogs and the blue column on the x axis is the number of views for the blogger.
@SCNBlogs blogs with the most page views
Again this is my sample data for @SCNblogs data over the last 3 months (Have I stated that before 🙂 )
I thought I would add additional attributes to the chart to see if I can confirm Wednesday as the day of the week to create/publish a blog?
So 6 of the 10 blogs appear on a Wednesday, so from my dataset Wednesday is the day to publish blogs. Most likely require a larger dataset over a longer period with all published blogs to prove this theory. However Wednesday 13:00 is the answer to my original question for this dataset.
A place for bloggers
While collecting the data I did use the bitly api to get the full url of the base SCN blog. When I discovered that the Lumira split command allows more than one character, I split out the http://scn.sap.com from the URL and was left with the community/space of the original blog.
Therefore the place for bloggers over my 3 month dataset is the business-trends community.
business-trends community top the number of views too.
A couple of mentions to blogs/information that I used from SCN.
* Nested JSON what fun! thank you to Dagfinn Parnas for asking this question http://scn.sap.com/thread/3180215.The question allows me to thank Peter Muessig who provided the answer. This was the method I used to get the data for the challenge in an SAPUI5 table format.
* Thank you to Eric Farrar for posting this blog on SCN that inspired me to try all of this in the first place. While I had to slightly deviate from the blog due to my lack of knowledge, I remain very impressed with SQL Anywhere. https://scn.sap.com/community/sybase-sql-anywhere/blog/2009/12/10/calculating-hash-based-message-authentication-codes-with-sql-anywhere
I have analysed the data I collected with some random samples and double checks and remain satisfied with the general quality. The data is driven by the twitter api on SCNblogs timeline and therefore misses out on some of the blogs on the main SCN site. The @SCNblogs account uses other URL shortening services such as tinyurl & spr.ly although only a few tweets using these. Unicode characters in the URL of the tweets threw out my data collection process and 150 backend blogs (mainly blogs not in English) went missing out of a total 2500. Where the URL could not be located with bit.ly api then the click count was set to zero and the backend blogger information could not be collected. The data is a snapshot in time and may or may not contain some errors. I’m still waiting for a an SCN api http://scn.sap.com/api that may help with any future SCN site data queries.
One last thing to do
Put my analysis money where my mouth is…
I have a taken the SCN blog option to publish at a certain date/time. Below screen shot of Wednesday 13:00.
However I am hoping that the timezone is correct and that my blog will be published and not moderated! I will find out soon if my plan comes together.
Oh and one final thing, I need to fill out the DG2 entry form to get another free t-shirt