Analyzing Wikipedia Data with SAP Lumira
In this final post in the series, we will analyze the Wikipedia data that we originally filtered using AWS Elastic Map Reduce and Apache Hive, then loaded the transformed data into a SAP HANA database and created an Analytic View on the tables. What we will do is look at several examples of how you can use the Wikipedia data to help determine the effectiveness of marketing programs.
Legal notice: Page view statistics for Wikimedia projects provided by the Wikimedia Foundation, Inc. at http://dumps.wikimedia.org/other/pagecounts-raw/ under the Creative Commons Attribution-ShareAlike 3.0 Unported License (see http://creativecommons.org/licenses/by-sa/3.0/). The original statistics have been filtered, sorted and visualized.
First off, you can get a free 30 day trial of the Personal Edition of SAP Lumira at http://www.saphana.com/community/learn/solutions/sap-lumira. This version works with SAP HANA One on AWS. I installed my trial on the AWS EC2 VM I use for my SAP HANA Studio client.
Creating a Low Privileged User for the Lumira User
Up until this point, we’ve been using the SYSTEM user account in SAP HANA Studio to perform our tasks. As the SYSTEM user is a high privileged – like the Linux root user – you will want to setup a user account that has just the needed permissions to access the data using SAP Lumira or other tools.
To set permissions using SQL statements:
1. In the SAP HANA Studio, right click on the Catalog folder and select the SQL Console command.
2. Paste the following statements into the SQL Console and then press F8 to execute the statements.
— Create the low priviledged user for Lumira
CREATE USER LUMIRA PASSWORD Hana1234;
— Grant SELECT permissions on the SCHEMAs where the
— data resides for your Analytic Views
GRANT SELECT ON SCHEMA “WIKIDATA” TO LUMIRA;
GRANT SELECT ON SCHEMA “_SYS_BI” TO LUMIRA;
GRANT SELECT ON SCHEMA “_SYS_BIC” TO LUMIRA;
— Grant remaining rights needed to view and execute
— Analytic Packages
GRANT EXECUTE ON “SYS”.”REPOSITORY_REST” TO LUMIRA;
— Check to see that the permissions were set correctly
SELECT * FROM “SYS”.”GRANTED_PRIVILEGES” WHERE GRANTEE = ‘LUMIRA’;
You should see the permissions from SYS.GRANTED_PRIVILEDGES as shown in Figure 1.
Figure 1 Permissions for LUMIRA user
Getting started with SAP Lumira
In this section, you will open up the PAGEHITS Analytic View created as part of the blog post “Building an Analytic & Calculation View from the Wikipedia HANA tables for use with SAP Lumira” with SAP Lumira so you can start your analysis.
1. Open up the SAP Lumira program located on your desktop.
2. Click the New Document button.
Figure 2 Create a new document
3. In the New Document Select a source window, click on the SAP HANA Online option.
Figure 3 Select SAP HANA Online as the data source
4. For the SAP HANA Server name, enter in your server name – for example: hanaserver. For the SAP HANA Server instance number, enter in 00. For the User Name field, enter in LUMIRA. For the User Password, enter in Hana1234. Then, click the Connect SAP HANA Instance button.
Figure 4 Connect to SAP HANA instance
5. To preview and select the data to analyze, click on the PAGEHITS view, click the Preview and select data checkbox and click the Select button as shown in Figure 5.
Figure 5 List of available views
6. Since all data is for the year 2013, scroll down to the bottom of the Attributes list and deselect the checkbox next to the YEAR attribute. This dialog allows you to preview data by clicking on the Click here to see values samples link and add filter conditions. Try clicking on the Values Preview links. Click on a Filter link to see how it works, but don’t set a filter at this point. Click OK when you are finished experimenting with the dialog.
Figure 6 Preview and select data
Upon clicking OK, Lumira displays it’s Visualize design surface as shown in Figure 7.
Figure 7 Lumira Visualize design surface
You are now ready to start analyzing the Wikipedia data with Lumira.
Quick tour of SAP Lumira
SAP Lumira allows you to change various preferences for the software through the File | Preferences… command. To make the text easier to read, you can change the font size using the Font used for UI rendering (*) option. In Figure 8, you will see that this session is using a 12 point font.
Figure 8 Change SAP Lumira preferences
You can also change the scale factor for charts for better readability using the Chart settings tab with the Apply a factor to chart font size setting. For example, 150%. Note: You need to close Lumira and then restart it and create a New Document to see the changes. There is no need to perform these steps to continue.
1. Resize the Object Picker to make it wider so that you can see the Measure and Attribute names better.
Figure 9 Visualize view
The Visualize view allows you to drag columns into the Measures and Dimensions regions of the chart area to display data similar to the way you would to preview data in SAP HANA Studio.
2. Click Data to view the data.
Figure 10 Scenario pane
By default, Lumira creates an Occurrences measure that is used to show the number of times an attribute value occurs in the data set. Each column is called a Facet. The Facets are automatically sorted in descending order based on the measure selected to the left of the Facets. You see that the date 3/12/2013 had the most occurrences of pages that had over 100 hits for the three months of data.
3. Click in the PAGEHIT… button for the PAGECOUNTHITFORHOUR measure to see how the Facets change based on the new measure that you selected. Then resize some of the columns to see the attribute values and page hit count values easier.
Figure 11 Facets showing PAGECOUNTHITFORHOUR measure totals
Notice that the day with the most page hits is April 3 2013. Also notice that the two days of the week with the least number of page hits are Saturday and Sunday.
4. To see both the data and work with charts, click on the Split button. You can create a quick bar chart by double clicking on the PAGEHITCOUNTFORHOUR in the Measures Object Picker to add it to the Y Axis 1 and then double click on the DAY_OF_WEEK in the Attributes Object Picker to add it to the X Axis.
Figure 12 Quick chart
5. To see the actual English day of the week, first click on the Visualize button next to the Split button to display just the chart data. Then, double click on the ENGLISH_DAY_OF_WEEK attribute. You will see that 0 corresponds to Monday.
Figure 13 Adding attributes to x-axis
6. To see how page hits by day of the week are impacted by different languages, you can use a filter on the Language facet. First, scroll to the Language Facet and then resize it so you can see the Language names and page hit values. Then click the Filter icon to open the Filter items dialog.
Figure 14 Filter on facet
7. To set a filter on multiple languages, select the first four languages in the list and then click the Add button to add them to the Filtered values list. Then click the Close button to set the filter condition.
Figure 15 Select a filter on top 4 occurring languages
To see that the filter was set, Lumira displays the facet filter in the Global Filters region above the facets. You will also see that the chart is updated based on the set filter.
8. To see how each language interacts overall with Wikipedia data, first click on the Line chart type just above the chart region. Then, drag the Language attribute in the Attribute Object Picker and drop it into the Legend Color Dimension.
Figure 16 Specifying the measure columns
You can see that German language picks up its activity on Sunday. You can also see that peak times with Japanese are Friday, Saturday and Sunday. One thing to keep in mind is that all times are based on the UTC time zone. We also can’t assume that a language is tied to a specific geographical location, especially English.
9. To clear the global filter, click on the little X icon next to the filter in the Global Filter area.
Figure 17 Clear the global filter
Note that all of the languages are now shown in the chart, many of which all overlap at the bottom of the chart. Click on the close icon next to the Language attribute in the Legend Color Dimension to remove the Language series from the chart.
Analyzing Wikipedia Page Hits
In this section, you will see how events related to a topic on Wikipedia influence the page hits for the topic. In this example, you will look at the page hits for the SAP HANA page.
At this point, you should have the PAGEHITCOUNTFORHOUR measure setup in the Y Axis 1 position as a bar chart.
1. Double click on the DATE_SQL attribute to add it to the X Axis to display the hits per day for the three month period.
2. Scroll the facet region so that you can see the PAGENAME attribute, click on the filter button for the facet and then type SAP_HANA into the search box.
3. Select the value and click the Add button to set the filter. You will see that the chart is updated to show the page hit totals for each day related to the SAP_HANA page. Then, click the Close button to dismiss the filter dialog.
Figure 18 Set filter on the SAP_HANA page
4. You will see that there are three major spikes in page hits for the SAP_HANA page for May 6, May 16 and May 29. The spikes on May 16 and 17 are easy to explain due to the number of announcements that were made at the SAP Sapphire conference May 14-16. To understand what happened on May 6, click on bar in the chart and then press the Filter button in the chart region.
Figure 19 Filter on a select bar
The result is a local filter in the chart that does not change the global filter and the facet values.
5. To see how hits looked hour-by-hour for the day, double click on the HOUR attribute in the Attributes Object Picker. You will see that all the page hits occurred in the 23pm time period UMT or 7pm EST May 6.
6. If you are someone at SAP or a SAP HANA watcher, you might know what happened. A search of press releases around that time period revealed the announcement of the SAP HANA Enterprise Cloud on May 7. Click on the link – http://www.sap.com/corporate-en/news.epx?PressID=20843 to view the press release. Note that the date for the press release was May 7. Before press releases like this, there is often a pre-announcement social media buzz or inadvertent media leak that takes place. To see what might have caused the buzz and the subsequent hits on the SAP_HANA Wikipedia page, click on the link http://www.saphana.com/docs/DOC-3554. There you will find an article titled SAP HANA Enterprise Cloud FAQ dated May 6, 2013 7:19PM. Anyone watching SAP would have an RSS feed on the documents coming from saphana.com, so it is possible that this article caused the buzz.
7. To clear the chart local filter, there is a chart Filter area in the lower left portion of the chart UI where you will see the DATE_SQL attribute. Click on the X next to the DATE_SQL attribute to remove the filter.
Figure 20 Creating a calculation view
Once you clear the filter, you will see a huge spike for the 23 hour on May 6 that is almost 10 times more than any other hour period for the three month period. For the marketing people at SAP, they could correlate this burst with any Facebook, Twitter and saphana.com activity to see if there was a similar spike at that same time.
8. To review the distribution of page hits per hour for the May 29 period, double click on the 5/29/2013 date in the DATE_SQL facet. You will see a fairly even distribution of page hits between the hours of 0700 and 1400 UMT. Going back through the SAP press releases for May 29, one stands out – http://www.sap.com/corporate-en/news.epx?PressID=21025. This press release is titled “SAP Named Global Market Leader in Business Intelligence by Gartner”. Again, marketing people at SAP could cross correlate this to their web properties to see how this press release may have caused a “buzz” on Facebook and Twitter. It appears that product announcement “leaks” can cause a larger bump in page hits in a short period of time over a positive analyst report. Again, this could be cross correlated with page hit data from other sources to determine if Wikipedia can in fact be a bell-weather baseline for social media events.
Using SAP Lumira to determine trending
In this section, you will learn how easy it is to create trend lines against the page hit data over time.
1. Click on the X next to the HOUR Dimension in the X Axis region to remove the HOUR from the chart. Then, click on the X next to the DATE_SQL in the Global Filters region to show all the data for the three month period. Then, click on the Line chart option to change from a bar to a line chart.
2. There is a small down arrow button next to the PAGEHITCOUNTFORHOUR measure in the Measures Y Axis 1 region that displays a drop down menu. Select the Add calculation > Running Average command.
Figure 21 Adding a running average trend line
This action displays the running average trend for the page hits over the period.
Figure 22 Calculation view design area
You can see that SAP Lumira created a new calculated measure in the Y Axis region that computes the running average on the fly.
Just to wrap up, you should have a good appreciation of how to do the following:
- Create a low privileged user for accessing the data using SAP Lumira.
- Create a new SAP Lumira document that connects to an SAP HANA Analytic View.
- Change Lumira preferences for better readability.
- Use the various regions to filter data at a global and chart level.
- Create bar and line charts with Lumira based on different attributes across a couple of dimensions.
- Create trend lines using a running average based on a measure.
I hope you have enjoyed this blog series. Be sure to check out my upcoming webinar where I will present the entire series on BrightTalk Sept 24 at 11am PST at https://www.brighttalk.com/community/cloud-computing/webcast/8927/84151