What is Big Data
Datasets whose size are beyond the ability of typical database software tools to capture, store, manage, and analyze. Refers to large or complex data sets that traditional data processing applications are not able to manage.
But what is the size of big data. Does hexabyte is a big data, I must say No, it is not a big data. However, the big data are the data that the size of data would be many hexabytes of data, structural, unstructural, videos, images etc. This means data can by anything or in other way I can say anything can be data.
The data becoming big data because of the growth of social, mobile, cloud, and multi-media computing, However, I feel some of the big company or retailer already had a data like big data, but they do not have a solution to it as the data storage available unable to accommodate or process the big data.
The hype of Big data started around 2015 onwards and since then there has been many solutions to it. Does, it solved the big data problem? Well, I must say might be or might be not.
The basic flow of handling date that can be big data shown below.
These three layer defines the flow of the data that are:
Ingestion Layer: The layer basically collect the data from different sources. These data are RAW data which can be structural, non-structural, videos, images etc.
Processing Layer: Essentially, this used to explore the data that are moved from ingestion layer. These are RAW data and need to refine it, so that the insight can be gleaned.
Speed Layer: This layer consumes the refined data to bring the meaningful information in the form of report, char and so on for the decision-makers. The Analytics uses these data to result the information for the business users.
Big data Architecture available.
When we have data that requires both real time and batch processing then it is difficult to handle the data especially with real time processing.
There are some architectures which address the Big data problem and some of the most used or talked about Real Time Big Data processing architectures – Lambda, Kappa and Zeta.
Among these, the Lambda architecture is most talked or used to overcome the Big data issues.
However, the lambda architecture is problematic when applied to data sets, because this architecture is complex in maintaining the results from batch and speed layer in sync and moreover, if different frameworks are used for batch and speed layer, then this architecture should be modified to solve the problem in different way.
Does this Lambda architecture solves Big Data Problem?
I am still skeptical that the Lambda, Kappa or Zeta architecture will provide cent percent result of precomputed data. Think about the case when the real-time processing has data that is changing very frequently or the data in batch processing taking longer time to provide a throughput as this batch processing has one disadvantage is, its throughput. The bigger question here, how do we handle the very high volume of redundant data.
An example, What I think about an IOT device which read the human breathing or heartbeat. The IoT Device reading each human breathing information that will have data of every seconds. Most of the time each human will have some pattern of breathing and for that can be considered as normal, however when there is changes then it becomes important to get instant notification or solution.
The bigger question is why do we need so much of data; this will create huge sets of redundant data. If we could have some mechanism to filter data at the point when data is read or captured and data like redundant data can be separated with the data that requires to be processed, then most of the main jobs are done. This will also allow the system space that can used to other data processing.
Well, to solve these issues, the concept of “edge analytics” for IoT based devices is gaining popularity. The edge analytic also called as Fog computing are designed such a way that the data can be cleansed and the analytic can be performed at the point where the data is collected.
In this case the Edge Node/ Analytic not only transferred data to the big data but also requires to cleanse, Filter, sample or aggregate incoming device data, reducing the amount of data sent to the center.
In most case the edge node uses the connector which reads and decoded the machine language to meaningful data and then it transferred to the big data.
Does the edge computing really solves All the Big data problem
The Internet of thing is generating countless new stream of data that requires to quantify and analyze things in a new way, that was never possible before. These new streams of data also bringing serious and new issue with security risk in process.
Somehow, I feel this is incomplete because for the cases like IOT devices, mobile and so on, this does solve the problem for handling data, however, what about when you would like to do sentiment analysis and can decipher the exact meaning of a word in that context or you would like to bring some meaningful information from some unstructured data that contains some log files from six systems, some corrupted data, with noise and errors. How do we know query generated by the data are correct or giving us a meaningful information to take a decision? Think for the case, if the processing/ speed layer unable to interpret the wrong data, then the result generated with these data can be disastrous. So, how we overcome this is most important and interesting too.
Processing Layer Enhancement
Therefore, the Data Processing Layer and Speed Layer are so important for processing the data and these layers need to be so robust and accurate that it should generate the correct output. Source of incoming data could be anything and we cannot restrict the processing layer only for IoT based only rather it should be designed such a way that it can accommodate any kind of incoming data and more than accommodate, this should also take care to manipulating and processing the data. This does not sound simple because this requires lots of thinking and various scientific theories to complete it.
These Data are not same as the traditional data and these are generating so fast in an unstructured manner that required a processing layer to do many things before preparing it for usable or meaningful information. More than processing the data, it is also important that the data should be clean and secure or in another way, how can be damaged or corrupted data can recover to correct data. The challenge is not to transform the unstructured data to structured data, but rather to manipulate in scientific calculation way such that meaningful information can be prepared with these data otherwise it will be waste of time, money and so on. Therefore, the role of the data scientist has become more and more important for the big data.
What is Data Science
Data Science is a field which can extract a knowledge or bring meaningful insight from data. The combination of different area like statistics, mathematics, programming, problem solving, computer science, capturing data in ingenious ways is key are for the data science.
The ability to look at things or predicts differently, and the activity of cleansing, preparing, and aligning the data for both unstructured and structured data. Not only this, but more important is also recover the corrupted data or bring important information from the corrupted data is also major role for the data science.
However, in recent time the change has been in this regards and with new configuration or we can an intelligent machine is also overcome the challenges we required to meet for the Data Science and this new concepts is also gaining lots of popularity because it is now able to address the most the Big data problem.
Machine Learning: Solution to Big Data
We are completely overdone with the way we analyzing the big data and need a new way to Analyze data. Curating and storing lots of data can be challenge and therefore, it is the time to think new innovative approach to solve the most talked about i.e. big data issue.
The machine learning algorithms like Linear Regression, Logistic Regression, Decision Tree, SVM, Naive Bayes and so on. These new ways of algorithms can autonomously analyze data and identify patterns and the best part of it that it even interprets the data and produce the meaningful information in form of report, chart or data visualizations
The combination of Big data processing layer and Machine learning will not only address the challenges that data scientist facing but also generate intelligent features quickly. The time taken to process the data to generate meaningful information is very less than the Data scientist with better throughput.
To create a great machine learning system, we need following.
- Processing Layer that should have strong Data preparation capabilities.
- System should have the capability to adapt new and advance Algorithms easily.
- Ease of use via unprecedented Automation and Iterative process.
- System should be flexible to adapt Ensemble modeling.
All of these things, i.e. machine learning mean it’s possible to build strong, robust and secure model that can analyze bigger, more complex data and deliver instant, more accurate results – even on a very large scale. And by building precise model, an organization has a better chance of identifying profitable and successful opportunities.
Thanks & Regards