Big Data and Data Science in Action – SAP Webcast Recap
SAP provided this webcast today, giving a background on data science.
Figure 1: Source: SAP
A data scientist uses mathematics and IT to solve business problems, asks the right questions, and use technical tools and programming languages
Figure 2: Source: SAP
How does data science differ from BI? The SAP speaker said BI defines standard reporting functionality while data science contains a math component
The maturity model shown in Figure 2 shows data mining, applying math standards to a dataset, algorithms, decisions trees, to find a pattern, to create clusters, or forecast a time series
Modeling comes in using a business process with a causal model, what are the driving factors, invent a math formula, or to use the data to fine-tune parameters
Optimization is looking at deviations, changing safety stocks.
Figure 3: Source: SAP
Figure 3 was a quiz – the numbers are in Euros
One set of numbers is true
The other set of Numbers is made up – invented by person
54% of the attendees thought the left column was false (including me). Wait until the end for the “final answer”.
Figure 4: Source: SAP
Figure 4 shows a retail example of how customers buy things
Retail generates data, measuring the impacts of sales promotions
Retail produces new products, need to be tested, problem is large # of products fail
Company puts two new products on shelf to sell products, which product has more or less
Figure 4 looks like product A is more successful, and it might leave Product A on shelf and remove B
First new flavor you may buy out of curiosity. The second effect – eaten it and like/not like, buy again
During the first month , the first effect is stronger, second effect is more important to keep customers in long term
You want to see how people buy product repeatedly to determine success and ask the right questions
Figure 5: Source: SAP
Figure 5 is a supply chain optimization example for a railway, where they manage a large supply chain of spare parts, with a complex set up, with different locations of serving trains, parts available, broken part.
Supply chains are managed in each location, replenishing policy – use spare parts until drop at reorder point and consume over time. Parameters are involved.
Who says reorder point is where it should be?
Solution looks at simulation in the future, using the historical information – statistical distribution for demand
They then optimize parameter to reduce reorder point so inventory is smaller to enable forecasting
Figure 6: Source: SAP
Next example of newspaper sales in Figure 6 provides forecasting with optimizing
If not send enough, lose sales, but if too much the newspaper incurs the cost of sending newspapers back
How many newspapers send each day is the model.
Look at history to forecast future sales; add safety stock
As an example, say Shop “B” in Figure 6 is a small shop next to football stadium, gameday sell a lot, others days not. It needs to take into effect special factors
More precise is the variability of demand
It uses model to optimize papers to print/sell
Figure 7: Source: SAP
Another use case covered was Utilities with sensor data analytics – power utilities – use for processes
“Before data science get the data quality in place” the speaker said.
Data record could have millions of entries – could be incomplete
Use data science to improve data quality:
- Look at & manually update- labor intensive
- Define business rules; business experts apply to dataset; takes time
- Use math algorithms to identify patterns in data
Combine all three approaches to improve data quality
A data science team consists of those with a math background and combines those with technical and visualization (to hide complexity) and the back end use big data. See http://readwrite.com/2014/07/21/data-scientist-income-skills-jobs
Figure 8: Source: SAP
Figure 8 shows SAP UI5 front end with good user experience with functionality
Figure 9: Source: SAP
Figure 9 shows how often customers buy 2 products at same time, to help promotions (does this mean orange juice is bought together with frozen bread)?
Figure 10: Source: SAP
Figure 10 shows the “Least common denominator”
BI and UI5 with UI5 combining transactions & analytical world, real-time, nice looking graphs with limited effort
SAP Big Data Platform includes HANA, Sybase portfolio
The algorithm side includes different tools with PAL in HANA or SQL algorithms, Java for specific coding
Spare parts simulator was built using Java
How to start a data science project:
Use cases workshop
Proof of concept project
Business case for full solution
Working with both business and IT
Data Science Quiz – Results
Figure 11: Source: SAP
How first number of numbers distributed as shown in Figure 11
When you falsify tax statement, you make the numbers look random
Every digit has the same probability
The speaker said open Wikipedia – look at numbers that describe quantities count – length of wall, write the down – first digit of number, 1 often, 2, often, 3 less,
8 and 9 almost never appear – Benfords’ law
Digit 1 once on right, so the right side is not real
Left side is real
Question & Answer:
Q: Retail example – are there other data points beside repeat purchasing to determine whether products are more popular?
A: visibility of shelf space – not data points used, include them as influencing factors
Systematically test programs
Q: Data mining – use full or sample?
A: It depends on business problem
Q: When talking about data quality, when take out outlier, need to understand where outliers are coming from – how assess?
A: Depends on business problem
Q: Related to data quality, example had corrected sensor data, assumption is you’re not dropping wrong data, then how ensure single source of truth?
A: Sent uncleansed data, used data mining to cleanse, data cleansing proposition, and use field examinations, and compare datasets – where don’t agree, which means a method has failed
Q: How far can you use Hadoop for Data Science and analysis and with SAP HANA?
A: Connect Hadoop with HANA – smart query layer- see Adobe example from ASUG Annual Conference 0404 Adobe’s Story of Integrating Hadoop … | ASUG
Q: How build Data Science skills? Reading list, training?
A: Depends on where start from – math, statistics, high performance computing – look at data mining tools (SAP Predictive Analysis, InfiniteInsight) – really meant to make Data Science for end user, see SAP training
_____________________________________________________________________________________________________________
For more information, SAP TechEd && d-code Las Vegas has 136 “Big Data” sessions and ASUG has 10 Big Data sessions
Monday, October 20th, ASUG will host a hands-on BI session – more to come soon.
ASUG has a Harness the Big Data Monster webcast on August 5 – register here