Cloud Data Warehouse – Some thoughts
For some time now, there is a nagging question in my mind where the simple answer is a continuously evolving one every few weeks.
Before going into the main question, Imagine this.
What if you were living two hundred years ago and were asked a question, that if you want to fly to India from Germany? The obvious response would have been “Du redest Unsinn” (“You are talking nonsense”). Flying was still not the concept then.
Let’s go back thirty years from now. If you are asked, if you are comfortable storing your banking information, in some computers somewhere in the internet instead of your own bank?” The response would have been similar.
Even ten years back, instead of asking an individual, if the same question was asked to a financial institution, about storing their financial information in the cloud, the response would have been possibly, “We are happy with having them in the physically secured, double gated, basement located datacenter in an undisclosed location”.
As you might have experienced yourself, things are not the same for long. Now, you feel more confident in handing over your personal information to be stored in the cloud than being kept in a local vault inside your own home.
During the late 90s, Scott McNealy of now non-existent Sun Microsystems had a sub-text to the company’s logo. The Network is the Computer. Little too early for the time at that time, People were not happy running a “thin client” at that network speed and it disappeared quick taking the company along. But, it is interesting to think of the possibilities with “still some more” higher network speed and efficiency going to be available with the 5G technology in the next few years.
Now coming back to my initial thought, why we are still thinking of building data warehouse that is meant for on-premise (even with abilities to handle data from many sources), instead of locating them as a complete cloud based solution?
There is no single answer for me at this time. The obvious and immediate answer that comes to one’s mind is, many customers want to utilize the available benefits like existing systems, including skilled developers and administrators, and/or not completely ready for the full-blown cloud based enterprise data warehouse solutions hosted outside, possibly because of the continuously increasing volume of the data warehouse.
Fig: Initial Data Warehouse design
image courtesy (IBM Systems Journal (1988))
Fifteen or Twenty years back, many hardware vendors of the Unix systems were shifting to a model of Utility based billing for their customer’s hardware systems though the systems themselves were running inside the customer’s double gated data centers. That was a major shift and not an easier transition for many customers at that time, as they were used to owning the enterprise’s computing systems one hundred percent, without realizing the cost of unutilized system resources.
With the network adoption growth, the growth of data volume did not wait for the growth in customer spend (for the hardware) to meet their need for speed. The customer needed the big computing power on demand at certain times and at regular intervals, like financial closing periods during their business operations cycle. The hardware vendors saw this as an opportunity and decided to sell the hardware with more computing power at a lower price, but the customers pay based on the system CPU usage in the utility billing model. This gave a breather for many customers and they transitioned to the fact that they pay for what they use.
Then, the data grew mainly because of the enterprise business growth and the volume of transactions, as well as the growth in automation of B2B transactions. So, it was not a bad option for the customer to have the computing power turned on and off as it was needed.
Now we are in the age of B2C. With this, many companies generate or depend on more data from outside the enterprises, then what was typically generated only within the enterprise including those between the business partners. Other than this, to get a 360-degree view of the business, enterprises want to understand and analyze everything possible to get a better return on their investment. They want to understand what the market or the customer is thinking about them from competitors’ information from market sources, customer’s postings in social media, their browsing pattern in the internet, spending trend from their internet transactions and putting all this information to predict the next strategy for their enterprise. So, the natural evolution is the need for more power to handle huge volume, variety of data to arrive at better results.
With the huge volume, variety, velocity (the three V’s of big data, including structured and unstructured data), though the technology to integrate the data in an on-premise enterprise data warehouse are becoming available, it is still bound to be time consuming to implement comparing to the solutions that can be built fast and quick in a cloud environment.
Fig: Cloud in the future
Image courtesy (cio.com)
With all these requirements, it is not only the CPU power that is going to be a variable here like the utility based billing before, but the total computing power that includes CPU, Memory needed and even Data storage capacity, as you do not want to retain all the data that has served its value. Solutions like Data Tiering or Near-Line storage can help with this (including for on-premise applications). With the cloud data warehouse, all these can easily be ordered on demand, and released when you are done with your analysis. The “no longer value data” can be moved from the high cost storage area to a lower cost one, reducing the operational cost.
As mentioned, the data volume grew more with B2C, where the data for an enterprise is getting generated more on the internet than within the enterprise. With that said, why would one see the need for bringing all that big data into an enterprise to analyze it, instead of going to where the data is, in most cases that is the cloud. This may not be the right solution for huge enterprises generating more data internally, but my assumption is, if you want to get the complete view, you probably need more data from outside than what is generated inside. It is also my estimation that the speed and cost of bringing the big data inside the enterprise will be more than keeping the data where it is for a much quicker insight.
Now think of the availability of the data warehouse systems and the analytic solutions on top of it. With the enterprise data warehouse solutions, you probably need the High Availability (HA) and / or Disaster Recovery (DR). The costs are going to be nearly doubled or quadrupled depending on what your need is. But with the ability for the cloud data warehouse to separate the layers makes it much more inexpensive and attractive. With the layers separated as data storage, computing systems, and analytic systems, if one or more of these breaks, those can be replaced or taken over relatively at lower costs, by the quickly configurable resources available in the cloud, assuming that the cloud solution is available with that capability and the systems can be available in more than one geographical location for any disaster recovery situations.
Data security is another important requirement to be considered. The cloud solutions are more secure now, and with the growth in the expertise of serving for many years, this might not stop many enterprises, as this also has the advantage of reduced cost in securing the data with in-house administrators.
Now imagine the B2C integrating more along with B2B and C2C along with other soon to be realized technologies including the 5G network. In just a few years from now, we can safely expect to see many enterprises depend completely on cloud for everything including data warehouses, instead of anything on-premise. For that matter, we might possibly find the old data centers changing into real warehouses to store, distribute and manage the produce by the farmers based on the data in the cloud data warehouse.