Big Data – Black Data
The promise of Big Data analytics is to make predictions. To turn data into action. The engine of a jetliner shows vibrations and hence it is pulled for an early check. The usage pattern of a customer unveils new sales opportunities.
But this requires the data to contain information about the root cause, else the data correlations are arbitrary. How to get that? What is the data about the root cause when the entire goal is to find the root cause?
This diagram shows that out of 100’000 inhabitants 170 male, 95 female died of heart diseases in 2015.
This data is perfect, unambiguous and we can learn from it. Men are twice as prone to heart diseases, almost three times to lung cancer. Fine.
Does not help us to expand the live span. Without knowing the meaning of male/female one way to live longer is to become female, as the data-set clearly shows. An correct conclusion but not side effect free, is it?
This type of issue shall be called:
What is variable? (influence-able)
Things like that can occur in various flavors.
- In the super market products in the upper third are sold more often then the ones close to the floor. Idea! Let’s position all goods in the shelf’s optimal level.
- In Q4 the sales revenue with toys is higher. Idea! Let’s call every quarter a Q4 quarter.
It all sounds like funny anecdotes, unfortunately it happens to all of us. In today’s world the dependencies are so high and it is easy to overlook a factor. Example? If we increase the pay of all employees in the EU by 10%, we all would have more money to buy more things. Is this true or false and why?
Cause or effect?
Okay, male do die three times more often of lung cancer. Why? Maybe women have a superior genome? Maybe men are smoking three times more? Maybe it is the type of work? Maybe women have healthier life styles?
One thing is obvious: If the provided data set does not include data related the root cause, then either no pattern can be uncovered and hence no action or worse, conclusions are backed by data although they are wrong. The raw data has to contain the root cause, we just do not know what the root cause is at the moment.
So for a proper analysis of the lung cancer area, smoking habits and similar additional factors would be needed in the data.
- A patient called Helena has a lower mortality rate then a patient called Hans. Is the first name cause or effect?
- Patients die in a hospital more often than at home. Is the location cause or effect?
- The jet engine shows vibrations. Is the engine the cause of the sensor reading or is the airplane’s wing vibration creating the sensor readout?
One important root cause is the human motivation. Example: Recently I looked for a camera at Amazon, what was the motivation for that?
- I was interested in buying it.
- I was interested in the technical specs only.
- I was curious if other customers criticize the same things.
- I opened the page by accident.
- I already decided to buy a Sony camera but compared the two for confirmation.
- I was buying the camera but not for myself.
The computer has little chance to distinguish between those. Hence human motivation is the cause of an action but the data only shows the effects. That is a problem.
In the 2018 series Westworld, the Delos Corporation operates an amusement park, populated by human-like AIs. Only later we find out that the actual business of Delos is to create AI surrogates of the park’s guest, to know their motivations, their desires, to predict their decisions in the real-life for Delos’ own business benefits.
We are not there yet. A good starting point is a 360° view on the customer.
The problem of a 360° view is the lack of competition’s data. Although Amazon is pretty large, not everything is bought via it. My Nikon camera was bought in another shop and I looked at Amazon reviews and ordered accessories. No chance for Amazon to get the information I bought the camera elsewhere. Not even if they provide a form of some kind to enter the data.
Two options spring to mind:
- Add market data. Then a revenue increase of 20% can be put into relation to the market condition, e.g. the market grew by 200% at the same time? Revenue growth is not good.
- Partner with the competition. Amazon marketplace is a good example for that. If Amazon does not make the revenue then at least they get a provision and still know about the sales.
An important example of partnering is using agencies. One internet app allows people to order food from various restaurants. Hence this internet app has the big picture, knows the competitive situation, the trends, price development, the customers etc. (In this example not for the benefit of the restaurants because they do not hold shares on the app. They are just the provider.)
Hence the suggestion, do not give away this chance. Create such application in a separate legal entity together with others. Else the most valuable information is owned by somebody else, the agency.
Getting into somebody’s head – the black box
The best option is to get the information from the customer directly. Social networks are king at that. For whatever reasons people are sharing information with the social network owners freely.
How can this be translated into other areas? Well, by following a few guidelines. The same guidelines SAP is using for its product development
- Customer first: Whatever is provided to the user, it has to have a clear value for him. A significant at that. Otherwise he will not use it.
- Run simple: The counterpart of above “value” is the cost. How difficult is it to provide the data. Ideally it should be unobtrusive.
- Gamification: If a user has no choice, e.g. a sales order has to be created in SAP ERP – there is no other way, he is forced to do so. Motivating individuals to share their desires freely is not of that type. Hence providing that data has to be fun, motivational, have competitive twist?
- Above cause-of-death data should be correlated with sport activities. One option would be to ask the relatives. Result will be pretty useless. Fitness trackers on the other hand follow all named principles. If only they are used by many and would be available for doctors – :sigh:.
- The app stores know and interact with the customers, the companies providing apps are played against each other.
- A grocery shopping list application per supermarket chain will not be successful. One app for all creates a win-win situation. Helps the user, is simpler to use and develop than many, can have aspects of gamification (“The top rated reviewers of each month get … for free”).
Looking at current data is the first level. Operational Reporting is the tool space.
Looking at the complete history level #2. Data Warehouse is the related term.
Extrapolating from the past changes into the future is level #3. Predictive tools.
Uncovering hidden information via Data Scientist methods (Statistics, Machine Learning, Predictive) is the competitive battleground, the differentiation between companies today. But too often this would need more than simply dump sensor data, it needs more data, data that might not even be available today. Hence we need to find ways to get this information if the project should be really successful.