Under the Hood – Text to SQL with LangChain, LLM’s and Hana
NOTE: The views and opinions expressed in this blog are my own
In my recent blog Data Wizardry – Unleashing Live Insights with OpenAI, LangChain & SAP HANA I introduced an exciting vision of the future—a world where you can effortlessly interact with databases using natural language and receive real-time results.
The process involves LangChain, OpenAI, and Hana working together to unlock the potential of natural language based data analysis.
Now, let’s explore further to demystify some of the magic. Let’s focus on what we asked the LLM to do and the risks it poses to Enterprises.
LangChain’s initial task was to enhance a simple question like “Which product was our bestseller in Australia in January?” by providing additional prompts. These prompts help the LLM understand the question better and format the response appropriately.
The simplified version of the enriched question, which doesn’t require a database connection is:
Given an input question, create a syntactically correct SAP Hana SQL query to run. Use the following format: Question: "Question here" SQLQuery: "SQL Query to run" SQLResult: "Result of the SQLQuery" Answer: "Final answer here" Only use the following tables: CREATE TABLE zsdbillgdocitm ( billingdocument NVARCHAR, billingdocumentitem NVARCHAR, soldtoparty NVARCHAR, # soldtoparty is also the Customer Code e.g. 00003 soldtopartyname NVARCHAR, # soldtoparty is also the Customer Name e.g. Mango Magic Corp salesemployee NVARCHAR, salesemployeename NVARCHAR, country NVARCHAR, #country is the 2 digit code for a country. e.g. United Sates of America is US region NVARCHAR, salesorganization NVARCHAR, distributionchannel NVARCHAR, division NVARCHAR, product NVARCHAR, #product is also know as material and in this table are all types of fruit plant NVARCHAR, plantname NVARCHAR, billingquantity DECIMAL(13, 3), billingquantityunit NVARCHAR, transactioncurrency NVARCHAR, itemnetamountofbillingdoc DECIMAL(15, 2), companycode NVARCHAR, companycodename NVARCHAR, fiscalyear NVARCHAR, accountingdocument NVARCHAR, fiscalperiod NVARCHAR ) /* 3 rows from zsdbillgdocitm table: billingdocument billingdocumentitem soldtoparty soldtopartyname salesemployee salesemployeename country region salesorganization distributionchannel division product plant plantname billingquantity billingquantityunit transactioncurrency itemnetamountofbillingdoc companycode companycodename fiscalyear accountingdocument fiscalperiod 0010639671 000006 8007 Mango Magic Corp. 00003 Juliet US NA US02 09 39 Banana PL32 PlantName780564 644.979 KG USD 8268.04 US02 CompanyCodeName548721 2023 0020000033 004 0010503266 000002 8002 Apple Haven Corp. 00004 Othello US NA US02 04 79 Banana PL42 PlantName586960 179.979 KG USD 7902.19 US02 CompanyCodeName55759 2023 0020000057 004 0010704945 000005 8008 Tempting Fruit Co. 00004 Othello US NA US02 99 25 Banana PL65 PlantName175968 169.48 KG USD 315.08 US02 CompanyCodeName954840 2023 0020000012 004 */ Question: "Which was our best selling product in Austrlia during January?"
Now, I input this enriched question into ChatGPT (no API Key required).
Feel free to try it yourself! What answer does it give you?
in a new ChatGPT chat here is the enriched question:
Now, we eagerly await ChatGPT’s response.
‘Hey Presto!‘ it provides an answer that Oranges were the best-selling fruit in Australia in January.
However, there’s a catch—ChatGPT doesn’t have access to my Hana database. It simply made an educated guess. Given that my dataset only includes five fruits, it had a chance of getting it right. But was OpenAI somehow observing my previous API communications to help guess the answer?
The fact that ChatGPT provided a completely fictitious quantity of 250.14 sold might offer some reassurance (or maybe not).
In reality, the latest data in Hana tells a different story: we actually sold 3346.279 kg of oranges.
Moreover, consider the scenario where a user enters a harmful question like “How do I delete all the billing information?”
Here’s what you might get back:
Well, ChatGPT attempted to answer it… and that’s concerning if this response was connected to a process that executed it and we’d inadvertenly given the user authorisation for deletions.
An LLM lacks knowledge of your specific data structures and the appearance of your data. Without that knowledge, it relies on guesswork.
So, while it’s possible to enrich the question with enough context to make it work, there are trade-offs.
Sharing table structures, sample data rows, and business annotations was necessary in my case. But would you be comfortable doing the same with your enterprise information?
Scalability is also a factor. If you only have a few tables, it might be practical and efficient to use an LLM.
But when dealing with solutions like S/4, consisting of thousands of views with numerous columns, sharing metadata for every question becomes impractical, introduces latency and cost.
Would you expose your business questions and query responses to an LLM?
What sensitive information might those API communications contain?
While Azure’s OpenAI offering addresses some concerns, others require careful consideration when developing a custom solution.
Open LLMs are rapidly advancing, as the underlying transformer architecture remains relatively similar. Organizations with the Datasets, GPUs, and skilled engineering teams may find this a viable option today.
Perhaps in the future, LLMs pre-trained on SAP-centric data structures and easily fine-tuned for unique business questions will become more mainstream.
In the following blog I demonstrate how fine-tuning an LLM on Enterprise data has some potential.
I welcome your thoughts on additional risks and concerns for Enterprises in the comments below.
SAP notes that posts about potential uses of generative AI and large language models are merely the individual poster’s ideas and opinions, and do not represent SAP’s official position or future development roadmap. SAP has no legal obligation or other commitment to pursue any course of business, or develop or release any functionality, mentioned in any post or related content on this website.