# The 8 Key Aspects of Data Science and Machine Learning in the Internet of Things 数据科学和机器学习在物联网应用的8个关键点

The biggest challenge for IoT and the “digital enterprise” is turning these huge volumes of data into information and, from that, performing analyses to improve business processes – “from sensor to insight to action,” as shown in Figure 1.

Data science is an interdisciplinary field concerned with extracting knowledge or insights from data and is an inclusive term for quantitative methods, such as statistics, operations research, data mining, and machine learning, as shown in Figure 2, along with the main analyses of each discipline. Synonymous terms include knowledge discovery and predictive analytics. The key point is that they all have the same objective – the improvement of business processes and, more generally, any process through quantitative analysis.

Over the years there have been several standards proposed for this process with the main one being known as CRISP-DM (Cross-Industry Standard Process for Data Mining). However, our experience in predictive maintenance projects – and IoT in general – has led us to expand on the standard and add two stages to the process we believe are missing, namely more explicit involvement of business domain expertise and ongoing monitoring when deploying the results of the analysis in business processes. Figure 3 summarizes SAP’s process for data science that we have developed though many IoT predictive projects.

**1. PREDICTIVE ENGINES**

To meet the requirements of data scientists, we need to provide a very comprehensive range of analyses and algorithms – but, at the same time, ones that scale to address the huge volumes of data. We can do this by enabling the most comprehensive collection of analyses and algorithms available by using the R integration for the SAP HANA® platform, thereby giving access to the incredibly comprehensive R open source algorithm library and, furthermore, its extensive data visualizations.

- The R Language for Statistical Computation and Graphics
- Predictive Analysis Library in SAP HANA
- Streaming Analytics with SAP HANA Smart Data Streaming and the PAL
- Automated Data Science with the Automated Predictive Library in SAP HANA
- Comprehensive and Scalable, Automatic and Expert

**2. DATA VISUALIZATION**

Large data-volume data visualization is a challenge. A scatter plot of a million data points may appear as a solid rectangle. We need visual interactivity and smarter representations of numerous data points. For example, Figure 9 shows a scatter plot of a large data set from an SAP IoT predictive maintenance project, then presented again as a hexbin plot with color graduation representing data volume from which a pattern can be discerned.

**3. GEOSPATIAL DATA ANALYSIS**

Our natural understanding of our world is through spatial analysis – mapping where things are and seeing how they relate. SAP HANA includes a multilayered spatial engine supporting spatial columns, spatial access methods, and spatial reference systems, to deliver high performance and results in everything from modeling and storage, to analysis and presentation of spatial data. With these enhanced geographical information system features, SAP HANA provides a common database for both business and spatial data. The spatial edition of SAP HANA includes spatial clustering using the algorithms – grid, k means, and DBSCAN (density-based spatial clustering of applications with noise). Spatial clustering can be performed on a set of geospatial points in SAP HANA.

**4. SERIES DATA PROCESSING**

When monitoring machine efficiency, energy consumption, or network flow, the ability to monitor data over time enables you to investigate and act on patterns in the series data. SAP HANA supports series data processing to enable efficient processing of large volumes of series data in conjunction with business data to assess business impact. This is critical functionality for IoT and predictive maintenance applications in which series data volumes are huge.

**5. UNSTRUCTURED DATA ANALYSIS**

Data science is mainly associated with structured data analysis – in other words, the analysis of data with a structure to it, usually in the form of variables or columns, by records or rows. However, there is a huge amount of data in unstructured formats, such as documents, e-mails, and blogs, which is generally textual, and hence the term “text analysis” is used when trying to analyze this unstructured content. It is said that up to 80% to 90% of enterprise relevant information originates in unstructured data residing inside or outside an organization, such as in blogs, forum postings, social media, Wikis, e-mails, contact-center notes, surveys, service entries, warranty claims, and so on.

SAP HANA supports text-search, text-analysis, and text-mining functionality for unstructured text sources. It supports full-text and fuzzy search using a full-text index to preprocess text linguistically, using techniques such as normalization, tokenization, word stemming, and part-of-speech tagging.

**6. SIMULATION – DETERMINISTIC AND PROBABILISTIC, AND OPTIMIZATION**

Simulation can take the form of deterministic modeling, whereby specific data values are used to model processes or operations and sensitivity analysis or what-if analysis is used to explore the inherent uncertainty in the data. Simulation in the form of probabilistic modeling explores the uncertainty through assigning probability distributions to the input data for a model and calculating the probability distributions for the output variables. For example, in a capital investment appraisal, you can estimate finding the probability of achieving specific net present values or discounted cash flow yields of the cash flow.

Optimization may be used to determine the overall optimal capital investment program subject to constraints such as the total amount invested and the required individual project investment levels, for example for maintenance. Both simulation and optimization are supported in SAP HANA through application function libraries.

**7. DEEP LEARNING ON SENSOR DATA**

The relevance of deep learning to IoT comes from the huge volumes of data generated by sensors Applications include image recognition, speech recognition, and robotics and motor control. Deep learning has been described as a set of algorithms that “mimics the brain” and is equated to neural networks that “learn in layers.”

**8. EDGE COMPUTING**

Computing on the edge is very important when a very quick response from the system is required – for instance, in the automotive area, the interaction between a navigation system has to be very quick when the data science component is trying to optimize fuel consumption by taking the driving style into account. It is also required when the data volume generated on-site is so large that the throughput required to process it by a central application, together with the incoming streams from other sources, cannot be provided. This may be the case in scenarios in which high-resolution images or videos need to be analyzed.