Connecting the dots - Using SAP Data Intelligence ...

former_member12171 · ‎04-11-2023

First things first

Regular ChatGPT uses a large language model and has a lot of general knowledge in training. I believe that if you are reading this blog, you already tried OpenAI´s ChatGPT and was impressed by it. However, things might get a little weird when we ask for insights about specific context that it was not trained with. For example, there is the classic and yummy cow eggs from a couple months ago in early versions of GPT:

This is a phenomenon called model hallucination. To address this issue, we use prompt engineering to guide the model towards a more accurate answer. This is paramount when we ask business specific questions which GPT was not trained. But how can we embed business context when calling OpenAI´s completion API? That´s what we are about to find out.

Prerequisites

This blog is part of the blog series "Connecting the dots - Using SAP Business Technology Platform to make GPT a better product". If you have not checked it still, I would strongly advise you to to take a look on it before moving forward.

SAP Data Intelligence tenant

AWS S3 bucket or other filesystem storage systems

HANA Cloud tenant or other database to be the source of the business context

OpenAI API key

Advanced Python knowledge

SAP Data Intelligence knowledge

Data context

The main objective of this blog series is to insert data context into prompts for GPT. As a very simple to train and extract the data, I will be using the SFLIGHT database and the data SCITAIRP table. To reduce the complexity (and costs involved on OpenAI´s API usage), this will be a simple example to check which airports are available in our dataset. You can either download this sample database from SAP´s official repository by clicking in this link and import it at a trial HANA Cloud instance following this tutorial or you can rogue and use your own data to follow the steps described on this blog.

SAP Data Intelligence pipeline overview

To generate the necessary embeddings for the prompt context, we will be using SAP Data Intelligence Pipelines. In a nutshell, this pipeline will use some standard and very well known features from SAP Data Intelligence.

First, we will read data from a HANA Database and writing it to a CSV file in an S3 bucket. Pretty straightforward for the ones who already use SAP Data Intelligence. There are some nice examples that SAP provides if you want to explore more about the steps required to do this. The important part here is generating the CSV file so that the Python operator might consume it. And this is where things begin to get complicated. To use the python libraries, we will need to create a dockerfile and a group, so that the Python operator is able to access OpenAI´s libraries. The documentation for the usage of python libraries and this neat blog written by Ian Henry might help. To be able to run the python script from this blog (which we will take a look at it in a bit), you will need to add the script below in the dockerfile:

FROM $com.sap.sles.base



RUN python3.9 -m pip install openai

RUN python3.9 -m pip install pandas

RUN python3.9 -m pip install boto3

RUN python3.9 -m pip install tiktoken

RUN python3.9 -m pip install numpy

And also the necessary tags to the python operator group and dockerfile:

Now comes the magic part, we will use python to call OpenAI´s API to generate embedding data based on the file that is stored in the S3 bucket. The embedding API is used to measure the relatedness of text strings.

For this blog, we will be not be doing a deep dive in the logic behind this script (as I´m not an AI expert and this is a subject that is currently the subject of many dedicated articles). But in a nutshell, this script will generate a vector with floating points, which represents the relatedness of each text string. The magic to use specific context in prompt engineering (which we will be seeing in the next blog) is that we will use cosine similarity to analyze which values are related in the stored embedding and the prompt embedding.

A great reference to better understand the concepts that I just shared is this Jupyter notebook from OpenAI´s Github. For now, what you need to understand is that this script will generate a CSV string which will be later used by the next step to write a file in an S3 bucket which will be retrieved later on when an user asks a question for our scenario. This way, we do not have to generate an embedding with the entire dataset every single request that an user makes.

import os

import openai

import pandas as pd

import boto3

import tiktoken

import json

from io import StringIO 

import numpy as np



os.environ["OPENAI_API_KEY"] = '<YOUR_OPENAI_API_KEY>'



openai.api_key = os.getenv("OPENAI_API_KEY")



session = boto3.Session(

    aws_access_key_id='<YOUR_S3_ACCESS_KEY>',

    aws_secret_access_key='<YOUR_S3_SECRET_ACCESS_KEY>',

)



bucket_session = session.client('s3')



def get_csv_document( bucket: str, key: str):

    return pd.read_csv(bucket_session.get_object(Bucket=bucket, Key=key).get("Body"))

    

def get_embedding(text: str, model:str="text-embedding-ada-002"):

    result = openai.Embedding.create(

        model=model,

        input=text

    )



    return result["data"][0]["embedding"]



def compute_doc_embeddings(df: pd.DataFrame, index: list) -> dict[tuple[str, str], list[float]]:

    """

    Create an embedding for each row in the dataframe using the OpenAI Embeddings API.

    

    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.

    """

    df = df.set_index(index)

    return {

        idx: get_embedding(r["CITY"]) for idx, r in df.iterrows()

    }

    





def construct_index(input2):

    airports = get_csv_document('<YOUR_S3_BUCKET>','<YOUR_S3_OBJECT_WITH_DATA_NAME>')



    embedding = compute_doc_embeddings(airports, ["AIRPORT"])

    

    csv_buffer = StringIO()

    pd.DataFrame(embedding).to_csv(csv_buffer)

    body = csv_buffer.getvalue()

    

    

    api.send("indexStr", body)



api.set_port_callback(["input2"], construct_index)

After adding the step to write the CSV to the S3 bucket using the output from the python operator, you can run the pipeline. If everything is correct, you should see in S3 that a CSV file with a bunch of floating points has been generated.

That´s great! Now that we have generated and stored the embeddings for the dataset, we can move forward to using this embedding in a prompt for the completion API, A.K.A. making a question to GPT, A.K.A the next blog: Connecting the dots - Using SAP Data Intelligence and python to embed custom data in OpenAI prompts