Generative AI for SAP using AWS Part II. Model Customizing (with RAG)
This is a series of blogs I am writing for SAP Tech leaders who want to understand how to use Generative AI from an SAP and Enterprise perspective using the AWS platform.
I know this is an immature technology, and this series of blogs will age really bad, but I decided to try it to share my personal impressions and best practices I see while I am in the process of personal and professional POCs.
In my first blog, I discussed what foundation models are available and how they can be deployed in a custom VPC in a custom Domain.
In this second blog, I address a fundamental question, how we can customize models, focusing on Data Privacy. The way I see it, AI has been there for many years; what has changed in the last months and has caused market disruption is the, allow me to use this phrase, give “power to the people”, more technically, the creation of pre-trained foundation models ready to be consumed, like GPT from OpenAI. There is massive information on the web about this right now, Bard here, LLAMA there, but we operate at the enterprise level, and for the Enterprise, the rules are very different to us as individuals; as individuals, we can ask ChatGPT a generalistic question about our next visit to Rome, but as Enterprise, the data must be relevant, secured and up to date. No data from public datasets might be appropriate, and of course, we need to feed our models with the most up-to-date elements.
OpenAI is fantastic, but in the enterprise, we see Data vendors like Snowflake, Databricks, or Datarobot (which integrates well with SAP DataSphere) continue to carve out policies that allow customers an alternative to using OpenAI path by quickly allowing its customers to develop machine-learning models from scratch, or use open-source models, rather than being dependent on proprietary ones like OpenAI’s GPT-4, which for the end-user is not relevant but for the Enterprise it matters significantly.
This wide-scale adoption of ML Models makes the concerns and challenges around privacy and data security paramount and ones that each organization needs to address. In this blog post, we will explore some potential approaches organizations can take to ensure robust data privacy while harnessing the power of these Models and feeding these models with SAP data.
Models like LLMs are typically trained on vast amounts of data to develop a statistical understanding of human language patterns. The extensive datasets used in training can often contain sensitive information, leading to privacy concerns. Additionally, the traditional approach of relying on cloud-based services to deploy and conduct inference with these models requires organizations to transfer their data to third-party centralized model providers, which can lead to data exposure, data breaches, and unauthorized access. This is why large corporations have limited employees’ use of LLMs with enterprise data.
Instead of throwing a plethora of technical jargon into this blog, I will describe the art of the possibility for SAP customers’ interest to leverage Model customizing, why that should happen, and how it can be done in the easiest way possible, using the AWS platform.
I use the AWS platform as an example for several reasons;
- I am totally biased in favor of the platform because 99% of the customers I serve on my business unit have their data on the AWS platform, and here is where my interest and expertise rely, but most of what I say is also possible in other platforms.
- I see AWS as an exciting approach, where the company sets itself in a clear Platform Strategy compared to Microsoft or Google, which are positioning their platforms toward their own Models or 3rd party models (like OpenAI), probably leaving less space for an open platform.
- I particularly like Amazon SageMaker and its capabilities; it’s a complex but mature tool. SageMaker has been with us for years for AI, but AWS is about to release a specific service for Generative AI, Bedrock, which is still in preview as of August 2023. Bedrock will dramatically simplify and enhance the Generative AI capabilities of AWS, but until its GA, we can achieve must on the AWS platform with current software.
- Bedrock lets developers select from a range of Models from co:here, Anthropic, AI21Labs, and Stability AI. Using these models, the AWS platform allows an easy build using existing Generative AI products, easing the development of applications like chatbots, which run on AWS’ infrastructure without worrying about the underlying infrastructure.
We are still unsure if there’s a lot of investment in Enterprise Models, but at least there is a lot of focus. If we focus on co:here (LLM), they are planning to integrate lots of Generative AI capabilities in Oracle Fusion Cloud and Netsuite as well as Oracle industry-specific apps, and so they are planning to do this with SAP. So the Holy Grail of the previous blog discussing private Foundation Models vs. Public Foundation Models will continue during this blog; private data sets/ public data sets are what the Software vendors Oracle and SAP are planning to use co:here Foundation Model enables companies to train on smaller data sets to generate higher confidence results, and customers and users get key benefits from all that enterprise data into their apps. All with Privacy and Confidence we need to provide to our customers.
So, we see the list of initiatives that prominent Software vendors are doing, well summarized in this blog post; customers can choose to go down a path (that has been there for years) to become their own data scientists, build their own training data sets and create algorithms and models, hire some good data scientists, buy specialized AI hardware (that’s not available in the marketplace right now)… or the new world of possibilities where we can work with a trusted Software Vendor like SAP with the good deep engineering team and data science expertise, that is going to build large sets of these AI capabilities right into the applications for us to consume, and then maybe we can do a little bit of modification and customization to train for specialty needs. But this will happen in the future, and until SAP releases more Business AI capabilities, generative or not, our ability to deploy these models and look for th
The problem with the Foundation Models is they come pre-trained with generic data; some models only know things until about 2021. While they give answers, it’s very generic, so if we wonder how much it costs to customize a GPT by giving our documentation as reference material, we will notice that;
☝️ Data is limited to a data, and its generalistic
✌️ Data must be fed constantly, not in batches
🤟 Context size is limited to tokens, in the end, words
🤔 The question is; how can we feed the models with our own data?
RAG and Embeddings
We can use Retrieval Augmented Generation (RAG) to retrieve data from outside a foundation model and augment your prompts by adding the relevant retrieved data in context. With RAG, the external data used to augment our prompts can come from multiple data sources, such as document repositories, databases, or APIs.
The first step is to convert your documents into a compatible format to perform a relevancy search. A document collection, knowledge library, and user-submitted queries are converted to numerical representations using embedding language models to make the formats compatible.
In machine learning (ML), Embeddings represent objects or items, such as words, users, or items in a recommendation system, in a numerical vector space. These vectors capture the relationships and semantic meanings between objects, allowing algorithms to work with them more effectively. Embeddings often convert categorical or discrete data into a continuous form that machine learning models can process.
1️⃣ Word Embeddings
2️⃣ Image Embeddings
3️⃣ Entity Embeddings… etc
Embeddings are typically provided by the creators of foundation language models. These embeddings are crucial to these models and play a significant role in their performance and capabilities.
Embeddings allow us to feed our Models with Enterprise data so we can achieve relevant outcomes; these are the benefits of feeding a Model with Embeddings;
✅ Fast and cheap
✅ Easy to update, can only update the most recent docs or non-structured data
✅ We can limit the answer to the data we need
✅ Since we are feeding a model, the answer is decoupled from the data, so if the model is Multilingual, the answer can come in any language
✅ We can leverage all the existing Analytical data we have already created differently.
Where and how can we find Embeddings Libraries? ⬇️
Hugging Face 🤗
Customize a foundation model with Fine-Tuning.
Customizing a Model is the easy path if we want to skip training. Model training will be inevitable for many companies, and there is no doubt that training is the most effective way to have the most relevant data, but it’s costly. The power of the Foundation Models is partially because they come pre-trained, allowing us to interact using new and innovative tools to avoid training until we are confident we need to do it.
I won’t discuss Prompt tuning today and skip directly to Fine-Tuning; the difference is I find it more attractive for SAP customers to discuss Fine-Tuning because models must be fed with data.
Fine-tuning: The art of updating Model parameters
Fine-tuning is further specializing a (already pre-trained) model on a broader data distribution by updating the model’s parameters.
This blog won’t dive too deep into the technical side of fine-tuning, but at a high level, we can think of fine-tuning along two dimensions: what kind of dataset you’re using and how you decide to update the parameters.
The structure of the SAP dataset will determine what the Model will learn
Models are representations of data. During the training process, models find a way of compressing and representing data to best capture the patterns in the data itself. The dataset’s structure determines what kinds of capabilities we explicitly want to feed the model with. Broadly speaking, fine-tuning entails at least three kinds of datasets:
|Dataset Type||What capabilities does it give the model?|
|🔤 Token-Based Dataset||Think of this as an unstructured pile of text. When training on this kind of dataset, we are simply conditioning the model to produce text more like what it contains.|
|👉 Instruction Dataset||Instruction datasets are composed of examples containing an “instruction,” an “input” and an “output.” At inference time, this dataset type allows us to provide meta-information about the task that we want it to perform.|
|💬 Human Feedback Dataset||This typically comes in the form of human preference comparisons of two responses: a winning response and a losing response. This data type is the most complex; the RLHF framework can use human feedback data to train a reward model, which can then be used to update the base language model via reinforcement learning.|
In my example, I use co:here LLM because they have a great offering on AWS to deploy their models in private, and they provide their co:here Multilingual Embedding Model on the marketplace to be consumed. This multilingual embedding model is what we need to map text to a semantic vector space, positioning text with a similar meaning in close proximity. We used this to map a query to the vector space during a search to locate relevant documents. This is NLP, not a search engine, meaning the results are sometimes better than a keyword search.
ℹ️ Pro Tip
The Vector Database will be quite big if we feed it from multiple SAP attachments. Make sure the Vector Database can grow
I selected Pinecone as the vector DB for this demonstration because it has all the requisites we need, scales as we grow, and is fully managed is cheap; it natively integrates with co:here Embed API Endpoint to generate multilanguage Embeddings, and then indexes those embeddings back into Pinecone for fast and scalable use cases.
The vector engine is powered by the k-nearest neighbor (kNN) search feature, which has proven to deliver reliable and precise results. Make sure the selected Vector DB easily supports integration with LangChain, Amazon Bedrock, or Amazon SageMaker so we can easily integrate your preferred ML and AI system with the vector engine.
for building LLM Applications
Langchain library offers a plethora of tooling and third-party integrations to build powerful applications (‘chains’) driven by LLMs. It also provides pre-built chains capable of some of the most common LLM applications like Retrieval Augmented Generation (RAG) or chat.
We can build and integrate many AWS applications with Langchain, and the concept of Bedrock is clearly influenced by Langchain.
Essentially, Langchain is to the Model framework what Hugging Face 🤗 is to the Models has dramatically simplified the development.
Langchain integrates with the complete ecosystem of Models, Data, and Databases, and in future blogs, I will describe how to easily integrate Langchain with SAP through API
In this blog, I introduce how to interact with Foundation Models using the Fine Tuning technique.
One of this Fine Tuning techniques is called RAG, which leverages Embeddings to represent objects similar to others; this dramatically changes our way of addressing data since we don’t search objects based on a key; we search them based on the context of what’s around it, making this a new way to interact with our SAP data, categorizing SAP data is faster and easier than expected.
In the following blogs, I will show how to build an effortless application on LangChain that interacts with co:here LLM interacts with us and feeds it from SAP.