Welcome to part 5 in this series of demonstrating how to analyze Wikipedia page hit data using Hadoop with SAP HANA One and Hive. In part four “Enhancing the Wikipedia data model using fact and dimension tables with SAP HANA”, I will use SAP HANA Studio to create an Analytic & Calculation view based on the PAGEHITSFACTTABLE fact table and the three dimension tables: DIMPROJECT, DIMLANGUAGE and _SYS_BI.M_TIME_DIMENSION. I will also cover the initial step of granting SELECT permissions to the _SYS_REPO user for the SCHEMA that that holds the fact and dimension tables so that you can actually save your Analytic View. I will be covering the high-level components shown in the diagram below.
Let’s get started!
Launch SAP HANA Studio. If not connected to database, click on the Choose Connection button on top right. Select your HANA database instance and click OK.
Make sure that the _SYS_REPO user has SELECT permissions to the WIKIDATA schema
For first time users of the data modeler in SAP HANA Studio, you must grant SELECT permissions to the internal _SYS_REPO user so that the data modeler can access the tables in your schema. If you forget to do this step, the data modeler cannot save the Analytic View.
In the SAP HANA Systems Pane on the left, expand by clicking on the arrows next to the following in order: HDB (SYSTEM) -> Security -> Users. Then, double click on ‘_SYS_REPO’ under Users. This displays the permissions page of the _SYS_REPO user as shown below.
Switch to tab SQL Privileges as shown below and click on the Plus button just above the list of SQL Objects to add the new permission.
Under Type name to find a catalog object, enter ‘WIKIDATA’. Select the WIKIDATA schema and click OK.
Under SQL object, select WIKIDATA. In the Privileges for ‘WIKIDATA’ window on right, check SELECT and select Yes for Grantable to Others.
Close the document *HDB-_SYS_REPO by clicking on the X in the tab and click Yes when prompted to save changes.
Data modeling using the Modeler perspective
In order to access the tools to create an Analytic Package and Analytic View within the package, you need to open up the Modeler perspective.
In the top menu, click on Window, go to Open Perspective and click on Modeler. If you do not see Modeler, select Other and click on Modeler followed by OK.
HANA Studio should display the Quick Launch page. If you do not see it, go to the Help menu and select the Quick Launch command.
Create an Analytic Package
The first step to exposing data to SAP Lumira is to create an Analytic Package. The Quick Launch page makes it obvious by displaying the first step in the middle of the Quick Launch page. Just click on the Create… button to get started.
In the Quick Launch tab, Welcome to Modeler window, select Package under New if it is not already selected and click on the Create… button.
I will call this package WIKICOUNTS and add a description as shown below. Click OK, when you are ready to continue.
NOTE: I am taking a short cut here by leaving the Person Responsible value as SYSTEM. In real life, you should use a lower privileged user account.
Create an Analytic view with the fact table and the three dimension tables
SAP Lumira uses the Analytic View as a first class data source, so it is time to create one. In the Quick Launch tab, select Analytic View under New and click on the Create… button.
In the New Information View pop-up, enter the name for information view and description as shown below.
We now have to associate the container Package with The Analytic View. To do that, click on Browse… next to the package field, select the WIKICOUNTS package and click OK.
Click on Finish to complete the naming of the view and continue on to the designer.
I now need to add the four tables that will be part of the Analytic View. To do that, I need to use the Data Foundation block in the Scenario pane. Just click on the Data Foundation in the Scenario pane and the click on the + sign that appears next to it to display the Find dialog.
I now need to find each table and select in for the data mode. Just type in the name of the table, in this case I start with PAGE to display all the objects that start with PAGE and then select the PAGEHITSFACTTABLE and click OK to add it.
I then followed the same process for the _SYS_BI.M_TIME_DIMENSION, WIKIDATA.DIMPROJECT and WIKIDATA.DIMLANGUAGE tables. Here is what the screen looks like after adding the tables.
Place the fact table – PAGEHITSFACTTABLE in the middle of the schema by selecting it and dragging it. This will make it easier for creating relationships.
Creating relationships within the Analytic View
I now have to create the relationships for the Analytic view. The idea is to join the lookup column in the dimension table to the corresponding column in the fact table. To setup the relationship between the M_TIME_DIMENSION and PAGEHITSFACTTABLE table, I first select the DATE_SAP column in the M_TIME_DIMENSION table and then drag a relationship line to the EVENTDATE column in the PAGEHITSFACTTABLE table. The order you do this operation in is important so that you get the correct one to many (1..n) mapping. That is, drag the column from the dimension table to the lookup column in the fact table.
Notice how the Properties dialog shows the properties for the relationship. To create the other two relationships, do the following:
- Drag the LanguageCode column from the DIMLANGUAGE table to the LANGCODE column in the PAGEHITSFACTTABLE table.
- Drag the ProjectCode column from the DIMPROJECT table to the PROJCODE column in the PAGEHITSFACTTABLE table.
I rearranged the tables to come up with something that makes the diagram look like a “star schema” as shown above.
The next step is to add the columns from the fact table and dimension tables as output columns. I am going to add them in a specific order to note the logical hierarchy of the data as follows:
- Right click on ProjectTitle in the DIMPROJECT table and select the Add to Output command. I do not need to add PROJCODE from the PAGEHITSFACTTABLE table because the join will correctly hook up the correct value for the ProjectTitle.
- Right click on Language in the DIMLANGUAGE table and select the Add to Output command.
- Click on PAGENAME in PAGEHITSFACTTABLE and press Shift+Click on Hour to multi-select PAGENAME, YEAR, MONTH, DAY and HOUR columns. Then, right click on the selected range and choose the Add to Output command.
- Select both the ENGLISH_DAY_OF_WEEK and ENGLISH_MONTH columns in the M_TIME_DIMENSION table. Then, right click on the selected range and choose the Add to Output command.
- Finally, select both the PAGEHOTCOUNTFORHOUR and BYTESPERHOUR columns in the PAGEHITSFACTTABLE table. Then, right click on the selected range and choose the Add to Output command.
Your diagram and list of output columns should look like the screen shot below.
Next, I have to specify the actual measure columns as follows:
- Click on the Semantics object in the Scenario pane to display the Details for the view.
- Scroll down the list of Local Columns, select the PAGEHITCOUNTFORHOUR column.
- Press the Mark for Measure command just above the Label Column as shown in the screen below.
HANA Studio automatically determines the rest of the attributes and measures during the Save and Validate step.
The last two steps are to perform the Save and Validate command followed by the Save and Activate command. Just click on each of these commands shown in the tool bar below to deploy the view.
As you can see in the Job Log above, the model was validated and activated as expected. The Model Validation displays a Status of Completed with warnings. To see what those warnings are, just double click on the message to view the Job Details.
Since I am not doing anything with ABAP or the BI Virtual InfoProvider, I can ignore the warnings. Click OK to close the dialog.
I can now do a quick test to view the data by clicking on the Data Preview command as shown below.
Next, I will drag the ProjectTitle column into the Labels axis and the PAGEHITCOUNTFORHOUR column into the Values axis to display a validation of the data lookups as shown below.
Creating calculated columns using a Calculation View
I want to make two more enhancements to the data model. First, I want to create a calculated measure that computes the BYTESPERHOUR divided by PAGEHITCOUNTFORHOUR to see how pages can grow over time for current event topics. Second, I want to add a hierarchy for the page hit date so that I can see data by year, month, day and hour easily within SAP Lumira.
To do these changes, I need to create a Calculation view that takes the PAGEHITS Analytical View as an input and then adds the enhancements.
To get started, navigate to the WIKICOUNTS package and then right-click and choose the New > Calculation View… as shown below.
In the New Calculation View, I will enter in the name of the view PAGEHITCALCVIEW, provide a meaningful description and then click Finish.
You should now see the Calculation View design surface that looks like the screen below.
Next, I will
- Select the PAGEHITS Analytic View in the object explorer
- Drag the PAGEHITS Analytic View under the Output object in the left pane for the designer.
- Drag a connection line from the PAGEHITS view to the Output object.
- Click on the Output object to start enhancing the model.
Creating a Calculated Measure
To create the Calculated Measure to compute the average size of the page for a given hour, I need to add the PAGEHOTCOUNTFORHOUR and BYTESPERHOUR columns as measures. I will right click on each of the columns and select the Add as Measure command. You should see the two columns under the measures folder. This makes the measures available for the next step.
I will now right click on the Calculated Measures folder in the right hand output pane and select the New… command to display the Calculated Measures dialog. I will give the measure a name and description as shown below.
To build the expression in the Expression Editor, I did the following:
- Expanded the Measures under the Elements list
- Double clicked on the BYTESPERHOUR measure
- Double clicked on the divide by symbol “/”
- Double clicked on the PAGEHITCOUNTFORHOUR measure
- Clicked on Validate to verify the expression and click OK to clear the validation message box
- Clicked OK to complete the measure definition.
Creating a hierarchy based on date attributes
I could add the year, month, day and hour attributes as individual attributes, however tools like SAP Lumira can additional filtering and shaping of results when using a date based hierarchy. Building a date-based hierarchy is simple. To do so, I Shift select the YEAR, MONTH, DAY and HOUR attributes and the right click on them and choose the New Hierarchy > New Level Hierarchy command as shown below.
I can then fill in the name and description fields as shown below to complete the operation. This is because the order of the fields do not need reorganization.
Clicking OK completes the operation. Notice that when I added the hierarchy, HANA Studio added the individual date attributes under the Attributes folder.
Adding the rest of the attributes to the output
The only thing left to do now is to add the rest of the attributes to the output. To do this, I just need to Shift select the remaining columns, right click and select the Add as Attribute command.
That is it! I can now run the Save and Validate command followed by the Save and Activate command to complete the creation of the Calculation View.
At this point, I cannot wait to jump into SAP Lumira to see that mysteries I can uncover from the Wikipedia data. In the next blog, I will show you how to use SAP Lumira to analyze data using the Calculation View I just created.
Just to wrap up, you should have a good appreciation of how to do the following:
- Grant rights to the _SYS_REPO user to the schemas used for the views
- Using the Modeler perspective and the Quick Launch page for performing actions
- Creating an Analytic Package
- Creating relationships for a star schema
- Using a Calculation View for enhancing the model to add calculated measures and hierarchies
I hope you have enjoyed the blog series so far!