Technical Articles
Into the SQL weeds – Fine-Tuning an LLM with Enterprise Data
NOTE: The views and opinions expressed in this blog are my own
This blog will provide a simple overview of the process of fine tuning Large Language Models (LLMs) with Enterprise data to help it produce tailored HANA SQL statements.
It’s a continuation of my previous 2 blogs:
Data Wizardry – Unleashing Live Insights with OpenAI, LangChain & SAP HANA | Demonstrates how questions on live Enterprise data can be asked with natural language can be translated to SQL and executed |
Under the Hood – Text to SQL with LangChain, LLM’s and Hana | Provides some further details of the underlying Architecture and highlights some of the risks and challenges |
Once again I’m trying to ask billing questions on my custom table ZSDBILLGDOCITM which is a simplified representation of the S/4 CDS View I_BillingDocumentItemCube , which I’ve populated with dummy data.
In the earlier blogs I already explored asking questions using OpenAI and ChatGPT and it does a very good job, if your provide a lot of enterprise info along with the questions.
The Azure’s OpenAI offering does provide the opportunity to fine-tune the language model for your specific needs.
In recent months there has been an explosion of Open Source Lanaguage Models available so rather than demonstrate finetune with Azure I thought it would be more interesting to show how a free model, would perform before and after some training.
Obviously it’s not completely free, you’ll still need to run it on a machine with some decent GPU (e.g. a Nividia A100). Time is money 😉
After a bit of searching in hugging face I opted to test the StarCoder model, it’s tiny fraction of GPT-3 size but is specialised in coding, so the hope was that it has a resonable command of SQL.
Disclaimer: Always check the models comerical usage guidelines as restrictions may vary.
Before Fine-Tuning
So lets give it a try with the same question from the earlier blog:
“Which was our best selling product in Austrlia during January?”
Not a bad attempt, but remember it doesn’t know anything about the table, field or the dataset.
There’s no chance it will run in Hana so not even worth a try.
Once again I don’t want to pass the table structures etc along with the question so instead I want to train the LMM on the enterprise data before hand.
Data Preparation
The most time consuming part of the activity is preparing some training files.
For example I need to create a list of questions and answers in the following JSON form:
[{
"prompt": "What is the total quantity and amount sold for each division in the billing dataset for the North America region?",
"completion": "SELECT division, SUM(billingquantity), SUM(itemnetamountofbillingdoc) FROM zsdbillgdocitm WHERE region = 'NA' GROUP BY division limit 10;"
}]
NOTE: the exact form and structure of your training data can vary with custom LLM’s, since you’re writing the logic, but to keep it simple I opted to use a similar format as that used by Azure OpenAI.
If you want to see the small training set I created you can find it here.
Train and Run
Next I load the dataset, tweaked the format, tokenized the data then train the model on the new dataset with the necessary transformer libraries in Python.
For a broad overview of the steps see the hugging face docs.
Training on an A100 with this tiny dataset of 100 examples took under 10min.
So lets ask the question again:
“Which was our best selling product in Austrlia during January?”
Cool.. It produced valid SQL:
SELECT product, SUM(itemnetamountofbillingdoc) AS total_sold_in_january FROM zsdbillgdocitm WHERE fiscalperiod = ‘001’ AND country = ‘AU’ GROUP BY product limit 10;
Not optimal but it runs and if the response was plugged into the LangChain logic from my earlier blog it would likely produce the expected answer.
Simple right?
Well actually it wasn’t so straight forward. The SQL being generated wasn’t quite what I wanted after much iterations of training and testing, so rather than create more training data I tweaked the prompts just a little bit to give some hints.
e.g.
The tips helped a lot, since the LLM knew nothing of the underlying data.
It’s not as fast and as accurate OpenAI but at the same time it’s potentially more flexibile depending on the use case.
The more training data you can supply the better it can be. 🙂
I hope you found this informative.
What Enterprise data would you want to train it on?
What concerns do you have about data security?
Could this be a viable option?
SAP notes that posts about potential uses of generative AI and large language models are merely the individual poster’s ideas and opinions, and do not represent SAP’s official position or future development roadmap. SAP has no legal obligation or other commitment to pursue any course of business, or develop or release any functionality, mentioned in any post or related content on this website.
Hello Sir. This blog is quite insightful and inspiring. Thanks for sharing your perspective. I was looking for someone for similar use case for help. Can you please open source the code for fine-tuning StarCoder?
Thanks for your interest. I'm planning to provided a tidied up / simplified version of the fine tuning process in a subsequent blog covering some more complex SQL scenarios (e.g. spanning multple tables, requiring joins). In the meantime though for StarCoder I tweaked a few things to keep memory usage down that will likely have impacted the fine-tuning too (e.g quantized the model to 4bit and applied LoRA on some of StarCoders attention weights), if I'd had more resources available I'd have skipped some steps to compare results. Othwise the fine-tuning step was mostly straight forward as follows, though I make no claim to having optimised the training set size or parameters for effectiveness:
Thanks for replying. It is helpful. Can you also share the data structure in which you prepared the training data? I recently got good results with fine tuned Falcon 7B model with some prompt tuning and engineering but still long distance to go. I am also planning to fine tune own custom model with Falcon/Starcoder to get more optimal result.
Hi. Here's a google colab example which may give you inspriation.
Hana_LLM_Experiments
Thank you for sharing Aron. I would test your code in my use case.
Very insightful Aron MacDonald sir, waiting for your new blog for more complex sql handling. and instead of querying over a table could you do the same over a database similar to SQLdatabasechain (langchain). waiting for your thoughts on this issue and please share the complete code, it will be very helpful. Thank you!
Thanks for the comments. Heres a google colab where you can test yourself Hana_LLM_Experiments
Thank you so much sir
Hi Aron -
The links to the above blog posts are not working.
Regards
Srini
Thanks for letting me know. 🙂
It should be fixed now.
Hi Aron, it is a very promising and useful work. By the way, I wonder if SAP has any internal api for employee to fine-tune a LLM with external knowledge bases? Thanks.