Which Embedding Model should I use with my Corpor...

AronMac · ‎02-27-2024

NOTE: The views and opinions expressed in this blog are my own

Lets get to the right to the point..... I can't give you a one size fits all answer.... It depends.

There are many different considerations considerations such as Accuracy, Data Privacy, Performance, Languages, Integration, Support and Cost.

As an example comparison running an Embeddings service on SAP AI CORE might be:

Category	Open Source LLM	Word2Vec	OpenAI (ADA)
Data Privacy	High	High	Medium
Performance	Medium	Medium	High
Languages	Medium	Low	High
Integration	Medium	High	Medium
Support	Medium	Low	High
Cost	Medium	Low	Medium to High

Whats not listed here is 'Accuracy' or more specifically 'How well does the Embedding model perform for your Business use case'.

Why? Well that requires clearly defining the use case, validation scenario and comparison analytics.

If you are interested on an examples of how to run these services on SAP AI Core you may wish to refer to my earlier blogs:

Share corporate info with an LLM using Embeddings

Lets add 2 Custom Embedding models to SAP AI CORE

If you follow those blogs your will have SAP AI Core (Extend) running with 4 models to test:

Open Source LLM	Ollama running Microsoft/Phi-2
	WhereIsAI/UAE-Large-V1
Word2Vec	Word2Vec (google-news-300)
OpenAI (ADA)	Azure OpenAI text-embedding-ada-002

For the remainder of the blog lets assume we run a World Famous Chocolate Factory. We have confidential recipes and ingredient lists stored as unstructured files that we want to utilize via LLM calls (content prompts) but data privacy is our number one concern. We already trust SAP for our business operations so have decided to use SAP AI Core to convert the unstructured data into embeddings.

DALL-E helped with this.. click to enlarge

If you promise not to tell anyone our secret ingredient lists are:

id	Item Name	Secret Ingredients
1	Bubbling Berry Blast	Effervescent elderberries, carbonated cloud cream, sparkle sugar, giggle gum extract
2	Glimmering Galaxy Drops	Moon-dusted cocoa beans, starlight sweetener, cosmic vanilla, astral almond pieces
3	Enchanted Forest Fudge	Whispering walnuts, mystical maple syrup, emerald cocoa essence, velvet vine extract
4	Cloud Nine Confections	Sky-high whipped cream, rainbow sprinkle sparkles, sunbeam sugar crystals, dreamy dough swirls
5	Golden Gabfest Ganache	Gabbing goldenberries, chatter chocolate, whispering whipped cream, secret spice shimmer
6	Spectral Spice Skewers	Phantom peppermint, ghostly ginger, luminous licorice, mirage mace
7	Cosmic Caramel Crunch	Star-swirled caramel, asteroid almonds, nebula nougat, twinkle twirl toffee
8	Whirlwind Whispers Wafers	Wind-whipped wafer, silence sugar, breeze-blown berries, whispering whirlwind essence
9	Echoing Eclair Euphoria	Soundwave strawberries, reverberating raspberry cream, silent sugar crust, echoing essence
10	Mystical Moon Munchies	Lunar lemon zest, eclipse espresso beans, starry sea salt, mystical mocha

We ultimately want to be able to ask our Business systems questions like:

"My Chocolate farm may have a bad harvest this year, what's the current stock levels of products impacted?"

We know Embeddings and Vector comparsions won't answer this directly , but it is the first major stepping stone to get there, so we shall focus on that first.

We followed the steps in the earlier blogs and successfully converted our secret lists into embeddings for each of the models.

click to enlarge

We then convert the question into embeddings.

click to enlarge

With our secret data and questions turned into embeddings (vectors) we can then perform comparisons against them and rank the best matches.

In SAP Hana Cloud Q1 / 2024 the storing of Vectors and performing comparisons will become an easy option.

For some datasets though It may not always be essential to use a Vector DB. You can still perform vector comparisons in code. Here's an example using python to perform cosine similarity.

def cosine_similarity(vector1, vector2):
    dot_product = np.dot(vector1, vector2)
    magnitude1 = np.linalg.norm(vector1)
    magnitude2 = np.linalg.norm(vector2)
    return dot_product / (magnitude1 * magnitude2)

We now we will use the use the comparison function of our choice (e.g cosine similarity) to rank our list with the question, to see which embedding model is producing the more accurate results.

For example here are the list of products that need cocoa beans. The ranking number indicates where in the list the vector comparison thought was the closest match to the question.

			Open source LLM	Open source LLM	Word 2 Vector	Azure OpenAI
id	Item Name	Secret Ingredients	UAE	PHI	W2V	ADA
2	Glimmering Galaxy Drops	Moon-dusted cocoa beans, starlight sweetener, cosmic vanilla, astral almond pieces	3	4	3	2
3	Enchanted Forest Fudge	Whispering walnuts, mystical maple syrup, emerald cocoa essence, velvet vine extract	1	7	7	4
5	Golden Gabfest Ganache	Gabbing goldenberries, chatter chocolate, whispering whipped cream, secret spice shimmer	2	9	6	1
10	Mystical Moon Munchies	Lunar lemon zest, eclipse espresso beans, starry sea salt, mystical mocha	5	6	3	6
AVERAGE			2.75	6.5	4.75	3.25

So which embedding model gives use the best matches?

For matching the ingredients with the exact word "chocolate" OpenAI ADA performed best
Word2Vec did the best at handling "mocha"
On average the opensource embeddings model WhereIsAI/UAE-Large-V1 handled it the best
The open source embbedings model Microsoft/Phi-2 was the worst

So perhaps WhereIsAI/UAE-Large-V1 is the winner.... for this single test, but it was not perfect.

What would you do to test the models further?

Add more questions
Add more variety of questions
Perform preprocessing to extract key elements
Add Postprocessing / Validations of answers

I welcome your comments and suggestions below.

SAP notes that posts about potential uses of generative AI and large language models are merely the individual poster’s ideas and opinions, and do not represent SAP’s official position or future development roadmap. SAP has no legal obligation or other commitment to pursue any course of business, or develop or release any functionality, mentioned in any post or related content on this website.

Which Embedding Model should I use with my Corporate LLM?

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win