First, allow me to apologize in advance. I am writing this article as I believe it may provide some insight into how and why you may want to use Apache Drill to act as one part of your connectivity suite from HCP to Hadoop. In doing so, I will be calling on our own implementation and our own connectors. So the apology is: If you see this as an advert, sorry. It is not meant to be, but as I am using our experience, it may come across as such.
Background to Apache Drill and why it is our SQL engine of choice for querying Hadoop
The Apache Drill project is a fully SQL compliant engine, that allows users to create ‘non-schematic’ on the fly queries to their Hadoop instance. This is important to the work that we do at SearchYourCloud, as we are in its purest form an enterprise search company. As such, we do not pre-define queries and we are federated, meaning that we search not just a single monolith database, but everywhere that your information may reside.
Is Drill better than Spark for connectivity? No, it is different. Spark requires you to incorporate your SQL statements inside a Spark application. This has merits but not when you are doing ad-hoc searches for reporting, BI or analytics.
Is Drill better than other SQL-On-Hadoop technologies such as Hive. Again, no, not in our view. Hive is fantastic at Batch processing, while Drill is great at querying.
What has to be remembered when choosing a technology for a particular task or job is to know not just the outcome, but also how will the task be implemented. In our case the implementation is pretty straight forward, as hopefully can be seen in Diagram 1.
Why do you need a Query Connector?
There are many reasons as to why you would not require one. These include creating scripts each time you want to run a query, not actually needing to query your Hadoop instance and your HANA instance along with any standard RDBMS. But in our case, we do. As I have said, we search stuff in real-time and across many different stores. This means that we need tools that are ready and waiting just to have an SQL query sent to them. This for us is the most compelling reason to have a Query Connector. It means that our customers have the flexibility to store information in multiple systems, yet still have access to the data as if it were from a single source. The diagram below outlines this better.
Our need to enable customers to create ad-hoc queries across an entire data-set, whether it resides solely inside a single DB or, as is the case in most instances, across multiple stores, drove us to create an SQL Query Connector. Our choice to use Apache Drill was a simple one, as we required:
- Ad-Hoc SQL Statements
- Non Batch Processes
- Queries not just across Hadoop, but also Azure
- The ability to federate those with other ODBC and JDBC sources
This meant that our path was limited to creating our own query engine or wrapping an existing proven one into our own libraries. The second option seemed so much more elegant, even if it were not as much fun!
My last words of advice are choose your technology dependent on your tasks and your required outcomes, not just on your own preferences (mine would have been to write our own).