Technical Articles
Hands-On Tutorial SAP Smart Predict, Used Car pricing
Today’s used car dealers face two challenges in pricing: On the one hand, the automotive industry offers a variety of models with a variety of configuration options and not every equipment detail increases the sales value of a used car and on the other hand, pricing becomes more transparent through relevant Internet portals. A used car dealer must therefore quickly find a reasonable price for an item that is difficult to compare due to its configuration variety. Many traders therefore take refuge in well-known indicators such as age, number of previous owners, accident wagons and the like.
But could not modern statistical methods, based on learning from past prices and using that to estimate the price to be achieved, help with pricing by valuing more attributes than just the classical ones? And this ideally without the need for rare, high-paid specialists, so-called data scientists, who would be under-challenged with such a standard statistical problem.
With Smart Predict, SAP Analytics Cloud provides a statistics component that allows the business user to solve standard statistical problems himself, thus freeing the data scientist from routine tasks and provide him more time for doing the heavy lifting.
This guide will walk us through the process of doing a price forecast analysis for used car dealer or web shop in SAP Analytics Cloud.
First, let’s have a look at the data of our car dealer. For this tutorial I simulated the data, so it might not be applicable to real cars. It describes the detailed configuration of each car its achieved price. Based on this data we want to predict the price for each car to sell. To find patterns in historic data and to train this model the following data set is used:
OfferID | Unique ID for each car offer |
Shop | Flag whether the car was sold via e-Shop or car dealer |
Price | The price the car was sold |
vehicleType | Describes the type of the vehicle, e.g. limousine or sportsbag |
gearbox | Flag whether the gearbox is automatic or manual |
powerPS | The power of the car in PS |
model | The model of the car |
kilometer | The mileage of the car in kilometer |
fuelType | Describes the fuel type of the car |
notRepairedDamage | Flag if the car has an unrepaired damage or not |
postalCode | Describes the postal code of the car dealer and is empty if the car is sold via e-shop |
stylePackagePremiumLine | Flag whether the car design is “Premium Line” or not |
stylePackageSLine | Flag whether the car design is “Sports Line” or not |
backupCamera | Flag whether the car has a backup camera or not |
GPS | Flag whether a GPS system is included or not |
VoiceControl | Flag whether the car has a voice control or not |
SportSteeringWheel | Flag whether the car has a sport steering wheel or not |
airCondition | Flag whether the car has air condition or not |
SportSeats | Flag whether the car has sport seats or not |
adjustableSteeringWheel | Flag whether the car has an adjustable steering wheel or not |
KeyLessGo | Flag whether the car has keyless go or not |
RearSeatHeating | Flag whether the car has rear seat heating or not |
warranty | Flag whether the car is sold with warranty or not |
effectPaint2 | Flag whether the car has this certain kind of effect paint or not |
heatableMirrors | Flag whether the car has heatable mirrors or not |
stylePackage1 | Flag whether the car has this certain kind of style package or not |
stylePackage2 | Flag whether the car has this certain kind of style package or not |
leatherSeats | Flag whether the car has leather seats or not |
InteriorPremium | Flag whether the car has the premium package for the interior design or not |
InteriorSport | Flag whether the car has the sports package for the interior design or not |
peculiarity_drive_assistant_systems | Describes which kind of peculiar drive assistant systems the car has from zero to 5; the higher the number the more extensive are the drive assistant systems |
effectPaint1 | Flag whether the car has this certain kind of effect paint or not |
Businesspackage | Flag whether the business package is included or not |
heatableSteeringWheel | Flag whether the car has a heatable steering wheel or not |
peculiarity_interior | Describes which kind of peculiar interior design the car has from zero to 5; the higher the number the more exclusive the design |
Age | The age of the car |
Log on to an SAP analytics Cloud instance.
After the logon the dataset needs to be uploaded. To do this we click on the menu on the top left and select “Create” and click on “Dataset”.
On the Pop Up, we select the source file “used_car_pricing.csv”, click “Import” and then “Ok”.
Now that we have uploaded the data set we can start to build our predictive scenario. We select “… More” and then “Predictive Scenario” on the menu.
A predictive scenario is a set of use cases with common characteristics. SAP Analytic Cloud’s Smart Predict currently offers 3 predictive Scenarios:
- Classification scenarios predict the value of a (target) variable that can only have two values like yes and no or 0 and 1. Examples for classification scenarios are
- customer churn with the target variable predicting whether a customer will leave or not
- Propensity to buy with the target variable predicting whether a customer will buy a product offered to him or not
- Fraud with the target variable indicating whether a transaction or claim was fraudulent or not
- Regression scenarios predict the numerical value of a target variable depending on variables describing it. Example for regression scenarios are the prediction of
- The number of customers visiting a shop during lunch time
- The revenue of a customer in the next quarter
- The sales price of a used cars
- Time Series scenarios predict the value of a variable over time taking into account further descriptive variables. Examples of time series scenarios are the prediction of
- Revenue for a product line over the next few quarters
- The number of bicycles hired in a city over the next few days
- Travel expenses in the next few months
The user now has to follow 3 simple steps:
- Choose the predictive scenario that matches his use case.
- Train the model with historic data, i.e. use a data set where the customer behavior (has churned or not) is known. The statistical algorithm will “learn” from this data set, i.e. find patterns that characterize a customer who is likely to churn. There should be enough positive cases (i.e. churned customers) in the training data set.
- Apply the model to a new data set, i.e. use a data set where the customer behavior is unknown. The statistical algorithm will apply the patterns learnt in the previous step to the new data and identify the customers who are likely to churn.
The variable that contains the price information in the learning phase and is predicted in the application phase is called the target variable.
The following screen shot shows the three options classification, regression and time series. Under each option there is a description to make it easier for the user to select the right scenario for each use case. In this exercise, we want to create a model to predict the price of each car. Based on the descriptions of predictive scenario types, we can see that a regression will be able to address our needs. So, we select “Regression”.
In the Pop Up window, we give the scenario a name, e.g. “Used Car Pricing” and save it in our folder.
Now we can create our Predictive Model.
We will need to select an input dataset for our model. The input data set contains historical data that we use to train the predictive model.
We select the Used_Car_Pricing data set we created a few steps before.
After selecting the input data let’s have a look at the variable details. We click on “Edit Column Details” directly below the field where we selected the input data and check that all data types of variables were correctly identified.
Please check that all storage and data types of the variables were recognized correctly and make sure that the target variable’s type is continuous like in the screenshot below.
After all variable metadata is defined correctly we need to select the variable roles:
The target variable is the variable that we try to predict after the learning phase. In our example we select the Target Variable as “price”.
Variables that have no influence on the target can be excluded from the modeling process. Excluding variable can speed up the execution process but keeping them does not interfere with the modelling process. In our Example Offer ID (since it is randomly chosen for each car) has no influence on the target and is excluded as shown on the next screen.
However, we must exclude variables that are directly related to the target variables such as transformations of the target variables and variables that contain indirectly the same information as the target variable. For example, if a dataset contains the car’s price with and without taxes, we need to select one as the target and exclude the other one.
Now our settings look like this:
Then we click “Train” on the bottom right.
Let’s run the predictive model with the default settings.
After our model was trained we can select version one. We see two performance indicators that describe the quality of the model. The mean squared error tells you how close the predicted values of the regression model are to a set of real values. It does this by taking the average of the squared difference from the predicted value of the regression to the real value.
The Prediction Confidence (KR) shows the robustness of the model. It describes the capacity of the model to achieve the same performance when it is applied to a new dataset exhibiting the same characteristics as the training dataset.
This chart shows the names of all variables that the model generation process identified as relevant for the target and orders them by their impact on the target variable.
- This chart displays the Predicted values vs. the Actual values.
- The perfect model, where Predicted = Actual (X=Y) is the green curve (Wizard).
- The model is the blue curve.
- The blue area shows where about 70% of the actual values are expected to be. It is a confidence interval around the prediction. Its width is twice the standard deviation of the target values. This means in other words: If we assume a Gaussian distribution of the actual values, about 70% of the them should be in the blue area (keep in mind that this is a theoretical percentage that may not be observed every time).
- About 20 segments or bins of predicted values are built. Each of these segments represents roughly 5% of the population.
- For each of these segments, some basic statistics are computed on the actual value :
- Segment Mean
- Target Mean
- Target Variance (with Target Deviation = sqrt(Target Variance))
- A dot on the graph corresponds to the segment mean on the X-axis, and the target mean on the Y-axis.
After examining this curve we can now have a closer look at the influencing factors.
We can see more details for each contributing variable. If we select for example “Age” in the drop down box shown in the next figure, we see that the variables values have been grouped. Each member of a group has a similar effect on the target, either influencing it positively (the top bars in the bar chart) or negatively (the bottom bars in the chart
In our example cars between 3 and 7 years old are not only more likely to achieve a higher price than older cars (what we would expect) but also than younger cars. Maybe cars between 3 and 7 years old are well within most peoples’ budget and still new enough to not cause too much trouble and therefore are considered a good deal. Looking at the details of variable contribution and trying to explain customer behavior could lead to new business insights and better marketing and sales efforts for the cars.
We can also do this analysis for other influencing factors for example model. In the variabel field we can search for this variable and don’t haveto always select it manualy.
Here we see that the customers have a clear preference for “Best Drive Car3”.
After we had a deeper look at the model and are convinced of its quality we now can apply it. We click on the little factory icon on the top left, use “used_car_pricing” (the new data) as input data set and choose a name like “used_car_pricing_result” for the output data set. As input variables we select “All Variables” and as prediction we choose “Predicted Value”.
We can add further statistical information by selecting “Assigned Bin” in “Statistics”. Then the predicted values are ordered decendingly and 10 bins are created containing 10% of the values each. Bin1 contains the 10% of the cars achieving the highest prices until Bin10 containing the cars achieving the lowest prices. With this we get an information about the price group for each car.
Now we can select “ok”
On the bottom we can see that the model is being applied. When it is finished we have a look at the results.
We navigate to the menu and select “Files”.
We navigate to the folder where we saved our output dataset, click on the dataset and scroll to the far right.
On the far right we see column “rr_price” that contains the prediction price and in column “quantile_rr_pice_10” the assigned bins.
While it is great to see this information for each individual offer, it may be easier for us to consume this information through visualizations. For visualizing the result in an SAP Analytics Cloud Story, we first need to build a model on the dataset.
We navigate to the menu and select “Modeler” followed by “From a Data Source”.
on the Pop Up we select “Dataset”.
We navigate to the folder where we saved our output dataset and select the dataset.
Click “Create Model” in the bottom right and select “Create”.
Tip: Multiply column rr_price by 100 to show percentages and rename the columns rr_price and quantile_rr_pice_10 to something easy to understand by the business user like Price Forecast and Price Category for example before you create the model.
Once our model has been built, we build our story on top of it.
We navigate to the menu and select “Stories” followed by “Canvas”.
We choose a visualization element for example a table. We are then asked to choose the model and select the model we just built from the folder it is stored in.
In the table we want to show the forecasted price and the bin of the price for each car. To build the structure of the table, we select “Rows” and then “Offer_id”.
Under “Columns”, we hover on “Account” and move our mouse to the right. We should see that a filter button appears. We select the measures we want to show in our table (i.e. rr_price and quantile_rr_pice_10) and deselect the other measures. Then we click “OK”.
We can also sort the table by selecting a column and the clicking on the arrow icon.
We can add some more chart by selecting the chart icon on the top.
Be creative and add the company logo or adjust the styling to the company’s corporate identity, add more data, nice charts or an RSS feed. Just try it out.
The official SAP analytics Cloud tutorial and the playlist are very helpful.
https://wiki.scn.sap.com/wiki/display/BOC/SAP+Analytics+Cloud+-+Official+Product+Tutorials
https://www.youtube.com/playlist?list=PLs5htBIwERYWSixKSqQHzndop33aBCz1U
Have fun with doing your own predictions and building nice dashboards.
Dear Sarah,
Thank you for this well written & detailed document. I'm interested in a dataset used in this article, but I am not able to find this dataset on the Internet, can you please provide the link for the same.
Thank you.
Regards,
Vineet
Sure, you can find the Data Set here.
Hi Sarah,
Thanks for the document. I have used the historical dataset 'used_car_pricing.csv' to train the model and to apply it we need new dataset. Can you please provide this file as well.
Thanks,
K Srinivas.
Unfortunately I don't have an apply data set. You would have to apply it on this data set.
Kind Regards,
Sarah
Thanks Sarah for the Dataset...
Hi Sarah, fantastic post!! I would love the opportunity to use this as a hands-on test... would you happen to have the data set available? I noticed the link from an earlier comment was no longer active. I appreciate your time!
Hi Brian, thanks a lot. I send you the data sets via E-Mail. Let me know if you need anything else.
Hi, the dataset used in this blog can be downloaded here
Best regards
Antoine