Harnessing Multi-Model Capabilities with Spotify –...

Vivek-RR · ‎12-28-2023

In blog 6 of our Spotify series, we shift our focus to visualization, explicitly pertaining to Playlist and Audio features API, as discussed in this series's second and third blogs. We will delve into the detailed consumption of Calculation views from SAP HANA Cloud, utilizing tools such as SAP Analytics Cloud and Microsoft Power BI. For SAP Datasphere models, based on the third blog, the consumption process is quite similar, and we will highlight the differences where they exist.

In this blog, we will cover the following topics in detail:

Consuming Calculation Views built using the Business Application Studio in SAP Analytics Cloud.

Utilizing R script to construct an interactive density plot to comprehend the Danceability metrics across all playlists.

Utilizing R script to create an interactive visualization of the Speechiness metric and compare it across all Playlists.

Leveraging a robust data visualization package in R, known as "Grammar of Graphics Plot" (GGPLOT), to visualize the Speechiness and Danceability metrics.

Consuming the same Calculation views in Microsoft Power BI.

This blog post is part of a comprehensive series. If you're interested in exploring more, feel free to visit the other blogs in this series:

Architecture

Processing Semi-Structured data in SAP HANA Cloud

Processing Semi-structured data in SAP Datasphere

Processing Semi-Structured data in SAP HANA Cloud & creating graph networks

Building Intelligent Data Apps based on Spotify Datasets

The Visualization

Consuming Calculation Views built using the Business Application Studio in SAP Analytics Cloud

This section will concentrate on accessing the deployed calculation view in SAP Analytics Cloud (SAC). Suppose you have followed along with blog 2 and cloned and deployed the associated Git repository, which now includes content for both reporting and building Graph networks. In that case, you should have access to the calculation view, CV_TOP_ALL[1].

To access the Calculation view from either SAC or Microsoft PowerBI, it's crucial to ensure that the hdbrole[2] is assigned to the database user created for frontend consumption. This step is vital for enabling the necessary permissions and access rights for the user to interact with the data through these platforms.

Remember, the correct assignment of roles and permissions is a crucial step in maintaining the security and integrity of your data while still allowing for flexible and robust data analysis capabilities.

This calculation view is based on the SQL view TOPALL1 discussed in blog2.

The semantics details for the calculation view, CV_TOP_ALL[1], include a calculated column named "Speechiness1".

As highlighted previously, the audio features from the Spotify API include a "speechiness" attribute. Speechiness measures the presence of spoken words in a track. Songs with exclusively instrumental music and no vocals have low speechiness, while rap songs and podcasts with continuous speaking have higher scores.

This metric helps compare speech patterns across playlists and geo-markets. Analyzing speechiness allows identifying playlists that contain:

Mostly music-centric tracks

Increased spoken vocal content

Distinct formats like rap, hip hop, podcasts

In summary, the speechiness measure detects tracks with more spoken emphasis over instrumentation. We can leverage this to spot regional playlist preferences for words vs music.

Source: Spotify API Documentation

I establish a threshold to differentiate speech-centric playlists by subtracting 0.33 from the Spotify speechiness score. Playlists with tracks containing predominantly spoken words (e.g., rap, hip hop, podcasts) typically have higher speechiness exceeding this threshold.

The interactive visualizations in SAP Analytics Cloud will spotlight playlists based on this speechiness threshold:

Playlists above the threshold contain more vocal-heavy tracks

Playlists below the mark have mostly musical instrumentation

With the calculation view deployed, the next step is allowing consumption by assigning database permissions.

Specifically:

Use the SAP HANA Cloud Database Explorer to connect to the HDI container holding the deployed view. This verifies successful deployment.

Then in SAP HANA Cloud Cockpit, locate the "json::TRACKSALL" hdbrole that got created.

Assign this hdbrole to the database user needing access to the view.

Enabling this role assignment via Cockpit is a prerequisite before the calculation view can be leveraged for reporting and analytics. The Explorer allows consumption confirmation post assignment.

On successfully creating a new database user, provisioning the roles, and connecting to the HDI container, you can effectively generate a scatter plot[4] based on the calculation view, CV_TOP_ALL[1].

The scatter plot is a powerful visualization tool that allows you to see the relationship between two variables. In this case, we have chosen to visualize the correlation between the metrics 'Danceability' and 'Popularity'[3] across tracks of playlists from four different countries. This is achieved by setting a filter[2] on 'Tracklistname'.

To interpret the scatter plot, each point on the plot represents a track. The position of a point on the horizontal axis indicates its 'Danceability' score, and its position on the vertical axis indicates its 'Popularity' score. If there is a pattern in the points, such as a line or curve, this suggests a correlation between 'Danceability' and 'Popularity'.

Remember, correlation does not imply causation. While the scatter plot may show a relationship between 'Danceability' and 'Popularity', it does not prove that increasing 'Danceability' will increase 'Popularity'. Other factors may be influencing both variables.

Consuming Calculation Views built using Business Application Studio in SAP Analytics Cloud[SAC]

If you already have access to the SAP Analytics Cloud (SAC) tenant, you can proceed with the next steps. However, if you don't, you have the option to register for a SAC trial, which lasts up to 60 days. This trial period allows you to explore and familiarize yourself with the functionalities of SAC.

For further information about the SAC trial, including registration process, available features, and any limitations, please refer to the provided link. This link should direct you to a FAQ page and additional details about the SAC trial.

Setting up connections to SAP HANA Cloud

Assumptions:

You already have access to the SAC tenant.

To create a connection to SAP HANA Cloud from your active SAC tenant, follow these steps:

From the main menu, select 'Connection'. This will take you to the page where you can manage all your connections.

Click on the '+ Add' button to start creating a new connection.

From the list of available connection types, select 'SAP HANA Cloud'.

Enter Connection Details: In the form that appears, you'll need to provide the necessary details such as host, port, and logon credentials. Click OK

Now, your SAC tenant is connected to your SAP HANA Cloud instance, and you can start creating models and stories using your SAP HANA Cloud data.

Let’s build live Data Models

Select the Modeler from the drop-down and build a Live data connection. Use the connection that was created in the previous step and select the calculation view CV_TOP_ALL.

Save the model once you validate all the measures and Dimensions. In my case, the analytical model is “Tracksall”

Let’s create a Story

Now that we have built an analytical model, we can leverage it to gain insights into playlists and songs by visualizing key metrics. Specifically, we will create data stories using ggplot to examine three aspects:

Danceability, Speechiness, and Energy Across a Playlist: We will visualize the danceability, speechiness, and energy scores for each song in a chosen playlist by plotting the metrics with a trending chart. This allows the assessment of metric distributions within that playlist.

Top Playlists for Danceability: Using R scripts and ggplot, we can analyze danceability scores across all playlists for a given week. By plotting a sorted chart showing the playlist with the highest average danceability at the top, we can readily identify which playlist contains the most danceable tracks.

Comparing Speechiness Thresholds Across Playlists: We set a speechiness threshold based on song composition to categorize tracks as having either high or low importance on words versus music. Then with R, we plot high and low speechiness scores for each playlist using ggplot. This enables easy comparison of which playlists favor tracks with more spoken words.

Danceability, Speechiness, and Energy Across a Playlist

This is a standard trending chart[1] based on the SAC Analytic model TRACKSALL, and basically compares the 3 metrics across the playlist TOPUSA[2]. TRACKLISTNAME groups all 50 tracks from the USA and I have used the filter[2] for the same.

When you analyze the top 50 songs from the USA tracks, you will notice the songs with high energy or danceability will have less "speechiness".

And songs like Rich Flex by Drake has higher "speechiness" value because of all the rapping yo! 😊 It’s a mix of R&B and RAP and you notice the high Speechiness factor.

Top Playlists for Danceability[Using R]

Prerequisites:

1. To utilize R-scripts, connect to either a remote R environment (BYOR) or an R environment provided by SAP in various data centers. Refer to this link for availability details. Follow these steps to set up a remote R server, similar to integrating R with SAP HANA.

2. For production use cases, verify required R packages. If packages needed for your use case are unsupported by SAP, set up an R environment with those packages accordingly.

3. Review the basics of ggplot2, an R library providing flexible, tidy, optimized data visualization. It enables integrated data exploration and analysis. Refer to the ggplot2 documentation.

For this Spotify example, I leveraged an SAP-provided R environment in the EU10 data center. First I will share the visualization, followed by the script, explanation, and SAC steps. This output utilizes interactive data visualization to compare "danceability" metrics across music playlists.

And here is the script for the R visualization.

library(plotly)

library(ggplot2)

salmon <- "#F8766D"

teal <- "#00BFC4"

orange <- "#D95E0E"

limegreen <- "#7CAE00"

tangerine <- "#FF9E13"

skyblue <- "#56B4E9"

junglegreen <- "#009E73"

mustard <- "#F0E442"

sapphire<- "#0072B2"

goldenrod <- "#E69F00"

viz4 <- ggplot(Tracksall, aes(x=Tracksall$DANCEABILITY, fill=Tracksall$TRACKLISTNAME,

                    text = paste(Tracksall$TRACKLISTNAME)))+

  geom_density(alpha=0.7, color=NA)+

  scale_fill_manual(values=c(salmon, teal, orange, limegreen, tangerine,skyblue,junglegreen, mustard,sapphire,goldenrod))+

  labs(x="Danceability", y="Density") +

  guides(fill=guide_legend(title="Playlist"))+

  theme_minimal()+

  ggtitle("Distribution of Danceability Data")



ggplotly(viz4, tooltip=c("text"))

How would I explain this to a Business User?

What this graph specifically shows is the danceability distribution across various playlists based on the dataset TRACKSALL. Using density plots, we can visualize how concentrated certain playlists are in high or low danceability scores. For example, some playlists have most of their tracks clustered on the higher end of danceability, meaning the songs tend to be quite danceable(e.g. Chile Playlist). Other playlists have a wider spread across the axis. Just by glancing at the colors and density shapes, we can get a sense of the variation in dance "suitability" across these playlists.

How can I understand the script from a technical perspective?

Sure, let's break down the script.

Lines 1-2: These lines load the plotly and ggplot2 packages. plotly is used for creating interactive plots, and ggplot2 is a plotting system for R.

Lines 3-12: Define a series of colors using their hexadecimal codes. These colors will be used in the plot.

Line 13-14: This line starts the creation of a ggplot object. Tracksall is your data frame, DANCEABILITY is the variable for the x-axis, and TRACKLISTNAME is the variable used to fill the density plot. The text aesthetic is used to specify the text that will be displayed when you hover over the plot in the final interactive plot.

Line 15: This adds a density plot to the ggplot object. The alpha parameter controls the transparency of the plot (1 is opaque and 0 is transparent). color=NA means that the outline color of the density plot is set to be transparent.

Line 16: This line sets the fill color of the density plot based on the TRACKLISTNAME variable. The colors are manually specified using the color variables defined earlier.

Line 17: This line sets the labels for the x and y axes.

Line 18: This line adds a legend for the fill colors with the title "Playlist".

Line 19: This sets the theme of the plot to be minimal, which is a theme with a clean and minimal aesthetic.

Line 20: This line sets the title of the plot.

Line 22: This line converts the ggplot object to a plotly object, which makes the plot interactive. The tooltip parameter specifies that the text specified in the text aesthetic should be displayed when you hover over the plot.

How do I implement this in SAC?

Please make sure R server is enabled on your SAC tenant or your remote R connection is established.

As part of your story, add R visualization as shown below:

Once you add the R visualization, provide the input data source[1] as your Analytics model “Tracksall”. Select the “Edit script”[2] option and copy the script that I shared. If your Analytics model name is different, please make sure you replace it in the script on line 13.

With just a few lines of code, the flexibility and ease of ggplot2 allows us to quickly visualize the data into an insightful plot, demonstrating the power of this graphics package.

Comparing Speechiness Thresholds Across Playlists - Using R

I will share the visualization, R script, and an explanation of the script. However, you can follow the same implementation steps in SAC as we previously discussed for the "Distribution of Danceability Data" example. Let's start with the visualization first: This script creates an interactive bar chart that visually represents the "Speechiness" of different tracks across various playlists. Speechiness measures the presence of spoken words in a track - the higher the value, the more words in the track. Each bar represents a track, with the bar height indicating the speechiness. The bars are color-coded by playlist, enabling easy distinction between playlists. Some tracks in different playlists have Speechiness > 0, potentially indicating rap songs, podcasts, or audiobooks.

And here is the script for the R visualization.

library(plotly)

library(dplyr)

library(ggplot2)

salmon <- "#F8766D"

teal <- "#00BFC4"

orange <- "#D95E0E"

limegreen <- "#7CAE00"

tangerine <- "#FF9E13"

skyblue <- "#56B4E9"

junglegreen <- "#009E73"

mustard <- "#F0E442"

sapphire<- "#0072B2"

goldenrod <- "#E69F00"

viz1 <- ggplot(Tracksall, aes(x=reorder(Tracksall$TRACK_NAME, -Tracksall$SPEECHINESS), y=Tracksall$SPEECHINESS, fill=Tracksall$TRACKLISTNAME, text=(paste("Track:", Tracksall$TRACK_NAME, "<br>",

"Artist:", Tracksall$Artist_NAME, "<br>",

"Speechiness:", Tracksall$SPEECHINESS))))+

geom_col()+

scale_fill_manual(values=c(salmon, teal, orange, limegreen, tangerine,skyblue,junglegreen, mustard,sapphire,goldenrod))+

theme_minimal()+

theme(axis.title.x=element_blank(),

        axis.text.x=element_blank(),

        axis.ticks.x=element_blank(),

        axis.ticks.y=element_blank(),

        panel.grid.major = element_blank(),

        legend.position="none")+

ylab("Speechiness Difference")+

  facet_wrap(~ Tracksall$TRACKLISTNAME)+

ggtitle("Speechiness Difference")

ggplotly(viz1, tooltip=c("text"))

How would I explain this to a Business User?

This interactive chart allows us to compare the "speechiness" of tracks across different playlists. Speechiness measures the presence of spoken words in a song. Songs with exclusive music and no words have low speechiness. Podcast and audiobook tracks would be highly speechy. The height of the bars represents the speechiness value - higher bars mean more spoken content. The color shading groups tracks by the playlist they belong to.

How can I understand the script from a technical perspective?

Lines 1 -3: Load the necessary libraries-plotly for creating interactive plots, dplyr for data manipulation, and ggplot2 for creating static plots.

Lines 4-13: Defines the series of colors using their hexadecimal codes. These colors will be used in the plot.

Line 14-17: Creates a ggplot object viz1 with the following features: The aes function is used to map the reordered 'Track Name' to the x-axis, 'Speechiness' to the y-axis, 'Playlist' to the fill aesthetic of the plot, and a combination of 'Track', 'Artist', and 'Speechiness' to the hover text.

Line 18: geom_col used to create a bar plot

Line 19: scale_fill_manual is used to manually assign the predefined colors to the different 'Playlist' categories.

Line 20: theme_minimal is used to apply a minimalistic theme to the plot.

Line 21-26: theme is used to remove the x-axis title, x-axis text, x and y-axis ticks, major grid lines, and the legend.

Line 27: ylab is used to label the y-axis as 'Speechiness Difference'.

Line 28: facet_wrap is used to create a separate plot for each 'Playlist'.

Line 29: ggtitle is used to set the title of the plot as 'Speechiness Difference'.

Line 30: Finally, ggplotly is used to convert the ggplot object into a plotly object, making the plot interactive. The tooltip argument is set to 'text', which means that the hover text will display the 'Track', 'Artist', and 'Speechiness'.

And the SAC Implementation?

Follow the same steps for SAC implementation as before, and you should be able to view both the SAC visualizations based on R-script.

How does this work with SAP Datasphere?

In Blog 3 of the Spotify Series, we detailed how to construct analytical models akin to those in SAP HANA Cloud.

When using SAP HANA Cloud, we connected directly to the database with a user who had access to the deployed containers. However, with SAP Datasphere, analytic models built within a space can be directly exposed when constructing trend charts in SAC. There is no need to separately establish a database connection. Instead, the data models are available for visualization as soon as they are deployed to the space. This enables simpler and faster data access when leveraging SAP Datasphere as the backend data source within SAC. The tight integration facilitates the rapid connection of visualizations to data models that have been developed and exposed through spaces.

You can build stories in SAC connected to SAP Datasphere in the same way as described in previous examples linking to SAP HANA Cloud. The process of building visualizations, filters, stories, and explanatory text follows the same methodology whether your data models reside in HANA Cloud or Datasphere.

OK, how to expose the models to Microsoft PowerBI?

Prerequisites:

You have access to the Microsoft Power BI Desktop App.

You have two options for connecting to data for visualization: HDI containers from SAP HANA Cloud or analytic models created in SAP Datasphere using open SQL schema access. If planning to connect from SAP Datasphere, refer to the "Database User Creation [Open SQL Schema]" section from the third blog post.

For this example, I will connect SAP HANA Cloud to Power BI. Once you are in the Power BI Desktop App, click on the "Home" tab in the ribbon, then click on "Get Data".

You can choose either "Import" or "DirectQuery" to connect to SAP HANA Cloud. Import means you'll be pulling the data into Power BI, while DirectQuery means you'll be working directly with the data on the server.

Enter your SAP HANA Cloud credentials when prompted in the Power BI connection window. After a successful connection, the Power BI Navigator will appear allowing you to select the specific tables, views, or calculation views to load. For this analysis, we will be working with the CV_TOP_ALL calculation view that was previously created in SAP Business Application Studio.

It may take a few seconds for the CV_TOP_ALL calculation view to load in Power BI. Once loaded, you will see CV_TOP_ALL listed on the right side of the screen along with options to build visualizations. To demonstrate some useful features in Power BI, I will create a simple table visualization[2] based on the data from CV_TOP_ALL.

Once you select the table option, a blank table visualization will load. You can then select the specific columns from the CV_TOP_ALL calculation view that you want to display in the table. For this example, I chose to include the Album name, Track name, and Image columns. The Image column contains public URLs pointing to album artwork associated with each track.

Select the Image column[1] from CV_TOP_ALL, and change the data category[2] to "Image URL".

Once you add those columns to the table, Power BI works its magic to transform the image URLs into actual album cover art on the fly. How awesome is that! With its slick auto-image rendering wizardry, Power BI saves us muggles from having to manually extract and embed images in visualizations. We just provide the URLs, and presto - album covers appear in the table as if by divination! 🙂

If you have been following this Spotify blog series, you may recall that we used the HANA_ML library to extract image URLs from Spotify and ingest them to SAP HANA Cloud. We loaded the JSON metadata containing these image links into the SAP HANA Cloud Document Store. SQL views were then created to select the image URLs from the JSON artifacts stored in the document store. These SQL views were incorporated into the CV_TOP_ALL calculation view, which combines datasets from various sources. This calculation view is later accessed in both SAP Analytics Cloud for visualization and Power BI to demonstrate auto-image rendering as we have explored.

In this Power BI report, I utilized the same CV_TOP_ALL calculation view that was created in SAP Analytics Cloud and generated two visualizations:

A table showing Album name, Track name, and Image columns. This allows viewing album art images automatically rendered from the image URL data in CV_TOP_ALL.

A ribbon chart comparing the Danceability, Energy, and Liveness metrics for each track. This is similar to the analysis done in SAP Analytics Cloud. In this case, the Track name is the lone attribute being compared across those three metrics in the ribbon chart.

Power BI's integration with custom visual apps allows for a more flexible analysis of the CV_TOP_ALL calculation view, including visualizing based on images. You can access these additional visualizations in Power BI by selecting "Get more visuals" from the Visualizations pane.

Selecting "Get more visuals" will open the Power BI visuals gallery. In the gallery, search for the "image grid" custom visual. When you find the image grid visual, add it to your report by clicking the "Add" button.

Once added, you will see the app as part of your Power BI desktop.

When you select the image grid visual and choose the Image column from the CV_TOP_ALL calculation view, Power BI will automatically populate an image grid displaying all 500 album cover images that were extracted from the Spotify data. By default, the images are rendered in the grid sorted by the predefined order in the view. A key capability offered by the image grid is the ability to visualize a collection of images and rapidly sort them by different attributes to spot visual patterns or trends. For example, with a few clicks, you can rearrange the grid sorted alphabetically by artist, genre, release date, etc. Looking at the images sorted in different ways allows you to analyze the data in new visual perspectives that may yield additional insights. Pretty cool, eh?

Additional filters can be applied to the image grid visual to narrow down the list of tracks being analyzed. For example, the grid can be filtered to only show images from the "Top Indian tracks" playlist that was ingested from Spotify. Furthermore, the Speechiness audio feature metric can be added as a filtering criteria to only display tracks above or below a certain speechiness threshold. Applying these types of filters allows slicing the 500-track image grid down to a subset of images matching the given criteria. This enables more focused visual analysis. For instance, in this case, filtering to Indian tracks with high speechiness shows the actual album images associated with that segmented list of verbal tracks.

I hope this Spotify series has sparked interest and ideas for adopting SAP HANA Cloud and integrating it into your own analytics use cases. My sincere thanks to the colleagues and community members who reached out to share the visualization blog—your feedback motivated me to complete it before the end of the year 🙂

Looking ahead to 2024, we plan to continue creating enablement blogs focusing on SAP HANA Cloud- either expanding on this Spotify series or developing a new one focusing on both multi-model analysis extending it to the new SAP HANA Cloud Vector Engine as well as the SAP's Generative AI Hub platform. There are so many emerging capabilities to explore!

Please stay tuned for more to come, and happy learning on your own data analytics discovery journeys in the New Year! Please feel free to reach out with any additional questions or feedback on the topics covered in this Spotify series. Looking forward to hearing from you! Happy Learning!!

References

Spotify Libraries & Documents

Spotipy Public Git Repositories

Spotipy JS Wrapper built by JM Perez

Spotipy R package contributions by Mia

Pandas Explode Function

Pandas Data-Driven Analysis

Tons of Medium blogs