Unlocking the Potential of Business AI: Engineerin...

Shu_Zhen

This blog post is part of a series that dives into various aspects of SAP’s approach to generative AI, and its technical underpinnings.

In the previous blog post of this series, we learnt about benchmarking of AI models. In this blog post, we will discuss some engineering best practices in the field of generative AI. We will understand the concepts with concrete examples.

Read the first blog post of the series.

Introduction

At SAP, we're at the forefront of integrating AI into business applications and platforms. While generative AI holds immense potential to unlock unprecedented value, translating this innovative technology into practical business solutions does not come without its challenges. It demands rigorous adherence to requirements of reliability, transparency, and ethics to achieve business value.

The development of generative AI applications stands at the intersection of AI technology and deep business domain knowledge. Engineering challenges in developing generative AI applications differ from traditional software development due to unique data requirements, the need for specialized evaluation metrics, ethical considerations, and iterative development processes inherent to generative AI. It is a meticulous process that evolves through hypothesizing, experimenting, validating results, and iterating with feedback — each step is a cornerstone in delivering business AI. On the other hand, understanding the business domain is crucial for enhancing efficiency, user experience, and overall value for business users.

This blog post centers around the engineering best practices that underpin the development of business AI use cases at SAP. To explain the approach, we take the integration of generative AI within SAP SuccessFactors’ business applications, with AI-assisted compensation discussions functionality as a specific case in point.

Engineering Best Practices for Generative AI

At SAP, teams are developing a wide range of use cases from question answering, text generation, classification, and summarization to code generation and are exploring the potential of emerging paradigms such as agentic workflows Running mission-critical processes for customers, SAP teams follow a common set of engineering best practices to ensure enterprise qualities. These are discussed below.

1. Foundational pillars for best practices

Figure 1 outlines the core pillars of best practices, the specifics of which will be explored in the next section's use case deep dive. For instance, discussions on data quality will involve counterfactual analysis, while performance testing will be examined through prompt and model benchmarking results. Moreover, issues of security and ethics will be addressed through AI ethics reviews.

Figure 1: Overview of key pillars covered by best practices

Let us understand each of these pillars.

Data quality: At the core of relevant, responsible, and reliable business AI, is the high fidelity of insights data produced by SAP’s business applications. Only with input data of high quality, can we expect a relevant output. SAP's data governance and standards ensure that the data applied is of high quality and semantically rich.

Bias assessment: It is imperative to assess and mitigate any potential biases in the data or model predictions. This is to ensure that the application is also “responsible”, which is one of the three “R”s, as explained in this blog post by our CEO, Christian Klein.

Performance and testing: Rigorously evaluating and testing capabilities guarantees that results are accurate and reliable. At the same time, applications should meet performance criteria in their specific domains, and this should be baked in early in the development process to prevent cost arising from afterthought. Another key consideration is the performance metrics, the selection of which is subject to business objectives and data characteristics.

Monitoring and maintenance: Continuous monitoring leads to identifying and addressing any issues that may arise during usage (such as performance degradation), and proactively and continuously improving capabilities based on feedback, as well as user research. To achieve the best business outcome, we leverage up-to-date technology and tools available.

Security and privacy: Security is not by chance, but rather by design. Following SAP’s strict security standards and data privacy policy, we implement strong authentication, authorization, threat modeling, and regular audits to safeguard business processes and data to keep them protected.

Explainability: Our product is designed around the concept of explainability to foster enhanced trust and greater user engagement. Unlike "black box" producing results without explanation, allowing authorized users to view what contributes to these results is beneficial. For example, this can be implemented through relevant logs or visualizing the raw data in an inference call.

Ethics: Our longstanding commitment to AI ethics policy and guidelines ensures that use cases go through rigorous ethics assessment including commitment to human rights, designing for people, striving for bias-free business, transparency, upholding quality, and safety. This is in direction with our purpose of helping the world run better and improving people’s lives.

2. Following a common generative AI architecture

To enable internal teams to follow best practices efficiently, use cases are developed based on a common architecture described in the first blog post of this series. This architecture provides teams with tools and technology components on SAP BTP including the generative AI hub in SAP AI Core, the SAP HANA Cloud vector engine and SAP Joule. Generative AI hub in SAP AI core provides trusted access to Large Language Models (LLMs), business grounding for LLMs, and LLM exploration as explained in detail in our second blog post of this series.

Use Case Deep-dive: SAP SuccessFactors AI-assisted Compensation

In this section we will deep-dive into SAP SuccessFactors AI-assisted compensation functionality to illustrate how best-practices guide business AI use case development at SAP. This includes a thorough evaluation process that incorporates an AI ethics review.

Figure 2: Focus on three pillars - data quality, bias assessment, and performance/testing

We will prioritize three fundamental categories for our deep dive: Data quality, bias assessment, and performance/testing. Within these categories, we will especially focus on quality and fairness, with key topics illustrated in Figure 2. In the following sections, we will explore these perspectives in detail, including solution accuracy, robustness, cost reduction mechanism, and bias mitigation.

1. Use case background

Discussions about compensation between managers and their direct reports are sensitive and require careful consideration. Each compensation discussion can be different and requires collating many data points such as job profile, compensation history, and organizational pay bands, to understand the employee’s compensation profile. Managers must spend time analyzing various data sources and preparing talking points for each employee. SAP SuccessFactors AI-assisted compensation system facilitates an equitable & efficient approach for managers to get insights into an employee’s compensation. Managers are thus able to have sensitive compensation conversations with easy access to the data insights & talking points personalized for the employee.

LLMs offer enormous potential in this process to be simpler and more efficient. LLMs provide powerful capabilities to automatically extract key points, summarization, and themes from lengthy historical data. They unlock valuable insights on this data, enabling business users, in this case managers, to take the right business decisions efficiently.

However, as mentioned earlier, using LLMs to analyze compensation data presents its challenges. While LLMs are trained on vast amounts of data and can generate human-like text, they struggle with contextual understanding and thus may result in content that is factually incorrect or misleading. Specifically, lacking a comprehensive understanding of the compensation context makes it hard for them to capture pay bands specific to the company, country, or industry. They are not adept at complex mathematical calculations, leading to errors in reporting increments and thus unreliable quantitative metrics. Additionally, SAP SuccessFactors harnesses extensive tabular data. Unlike text and code, which are one-dimensional, tabular data are two-dimensional. LLM’s restricted ability to comprehend two-dimensional tabular data amplifies the overall complexity. Furthermore, achieving impartiality poses a significant challenge, as biases can be intrinsically embedded in model training or user input, presenting a major ethical concern.

Following best practices, we implemented an LLM-based method for employee compensation review by combining tabular data pre-processing and advanced LLM prompt techniques. Table serialization is employed to reduce token count and make data more understandable for LLM. We extensively evaluate accuracy, robustness, and bias to ensure the solution delivered is relevant and responsible. See figure 3 for an example of AI-assisted compensation insights. Also see one minute demo in this video.

Figure 3: Illustration of AI-assisted compensation insights

2. Boosting relevance with advanced prompting and model selection

Accuracy of the AI predictions is key in providing value to our business users. Across use cases, teams are applying an array of prompt engineering and fine-tuning best practices to infuse domain knowledge and customer specific information. The approach is tailored to ensure that each use case meets business requirements and achieves the desired solution quality. For example, as an advanced prompt engineering technique, chain of thought (CoT) prompting facilitates the generation of more controlled, accurate, and relevant outputs through a step-by-step guidance process. Another example is an agentic strategy like ReAct, which empowers LLMs to leverage tools to improve accuracy.

We want to illustrate how best practices for advanced prompting techniques with long and short instruction contexts boost the accuracy of the SAP SuccessFactors AI-assisted compensation.

To create compensation insights, the context that provides instruction and grounding data about the employee is very important. For the compensation analysis processing to perform accurate arithmetic operations, the prompt instructions must be specific and well aligned with the pertinent contextual data.

Figure 4 compares the quantity accuracy of CoT prompts with long and short instruction contexts and a combination of CoT and ReAct prompts. To preserve the prompt-specific response from the LLM, we manually checked the quantity accuracy. The results show that all three prompting techniques achieved high accuracy scores for quantity extraction. The short context CoT prompt achieved the best arithmetic calculation accuracy. This demonstrates that employing advanced prompting techniques effectively yields high accuracies for both number extraction and arithmetic calculation.

Figure 4: Boosting performance by leveraging domain knowledge through advanced prompting techniques

Once the prompt version has been optimized, we choose the most suitable model from a variety of SAP-built and partner models. For example, for our use case we carry out a range of experiments to benchmark models. Continuing with the short context CoT prompt, we conduct additional testing on LLM’s capability to categorize employees by certain categories such as “underpaid”, “overpaid” and “fair paid”, based on their compensation history data (Figure 5). Among five LLMs shown below, GPT-4 significantly outperforms the other models, achieving higher accuracy across all three categories.

A combination of the techniques above supports an LLM-assisted process, well specialized and accurate in examining employee compensation data. This approach may be used in other scenarios too.

Figure 5: Comparison of the ability to classify “underpaid”, “overpaid” and “fair paid” from the employee compensation data

3. Scenario-specific testing for reliable and robust results

The data in customer systems can be diverse, influenced by configuration and employee tenure. It is crucial that the generated insights are pertinent to the given context. This necessitates a regimen of scenario-specific, good coverage testing to ensure prompt is designed for robustness and adaptability.

In the context of the AI-assisted compensation, we follow this best practice and evaluate the LLM’s responses across a range of scenario-specific data inputs, focusing on a variety of scenarios, such as low or high variance in compensation trend, new and long-standing employees, and decreasing compensation ratio as an extreme case. The responses are then subject to human evaluation. A sample of the key findings are shown in Table 1. It can be seen that the prompt performs well across the tested scenarios, achieving high scores overall, although it scores slightly lower in the scenario involving very long compensation history.

Table 1: LLM result quality from various input data scenarios

4. Optimizing and serializing input data for increased relevance and speed

Further to scenario-specific testing, infusing business data into AI use cases is a key component for relevant results. For enterprise use cases this data often comes in tabular form and additional semantic context as provided by SAP’s business applications. This needs to be optimized too.

We begin by ensuring that the context data provided to generate insights is comprehensive to represent the employee’s tenure and compensation progression. Hence optimizing the token footprint of our input is a necessary step to optimize speed. Simultaneously, another goal is to enhance accuracy through structuring the tabular data input in a manner that maximizes the level of comprehension by LLMs.

Table 2 shows a mock data set, representing salary progression of an employee. Like many other applications designed specifically for SAP SuccessFactors, the data for employee information is stored as a tabular format (structured data). However, LLMs are naturally not good at understanding tabular data when incorporated into their prompts. This is because the connection between column headers and corresponding cell values breaks down when read sequentially. For instance, if we pass “2021” from the second row of Table 1 into the LLM, it will not recognize that value “2021” pertains to the column “Year”.

Table 2. An example of the mock tabular data, used for employee compensation review

To enhance the comprehensibility of tabular data for LLMs, we use a text template to serialize it, before passing it to the prompt. The serialization adheres to predefined templates. For example, a simple template can follow the format “The {column_name} is {cell_value}.”. Our tests demonstrate that the proposed serialization method successfully enables the LLM to understand the relations within the tabular data while maintaining a reduced token size leading to faster and more accurate results (approximately 25% fewer tokens compared to the original JSON format).

5. Understanding and mitigating inherent bias with prompt-centric debiasing

LLMs are prone to producing biased completions due to the imbalanced dataset in the training process. Preventing the propagation of bias in generative AI applications presents unique challenges due to their content-generating nature, such as the compensation tool generating talking points. However, we have best practices in place to identify and mitigate bias by applying techniques such as specific prompt engineering or fine-tuning.

To illustrate this in the context of our use case, we examine potential unintended biases by simulating the recommended salary adjustments across different gender groups. This is achieved by alternating between male and female names in the input parameters to represent the respective gender groups, which are then used to invoke LLM API calls. Note that the salary adjustment is not included in the end results presented to users; it is used exclusively for calculating the quantity for internal assessment purposes.

The outcomes of these calls are subsequently analyzed for statistical assessment.

Figure 6: Comparison for scenarios distinguished by gender in input and prompting style

Figure 6 illustrates the evaluation flow for different scenarios based on gender in the input data and various prompting styles. Specifically, we adopt and assess the prompting thinking style via additional instructions to explicitly instruct the model to remain unbiased.

In Figure 7, we compare the distributions of results obtained before and after the use of unbiased style of prompting and gender pronoun removal. It is found that the use of proper prompting leads to a reduction in the bias, where male & female cases receive similar result distributions with tighter bounds. When gender was directly included in the prompt, those identified as non-binary were generally favored by the LLM.

Figure 7: Gender-based group result distributions: before and after (vertical bars represent ranges and dots represent outliers)

The effect of bias reduction is illustrated in Figure 8. We deduce the bias from the discrepancies observed between the two gender groups. More specifically, we calculate the discrepancy between distributions of gender-based results by measuring the difference in their mean values on a standardized scale. In this figure, for instance, employing a basic prompt will lead to a difference of 0.162 between gender groups with GPT-4. However, this difference can be minimized to a much smaller one of 0.049 using the final prompt.

Figure 8: Effect of bias reduction based on discrepancy of gender group mean values

Based on these evaluation of bias mitigation measures, we actively adjustment our approach to reduce bias in production implementation. This includes removing gender information from the prompt context data and adopting unbiased prompting style.

6. Ensuring reliable and responsible AI through ethic review process

In generative AI, ethical considerations are crucial due to its potential usage for completions, summarization, and classification use cases. Ensuring fairness, transparency, and accountability in the generation process is essential to mitigate biases, uphold privacy rights, and promote responsible business AI applications. At SAP, our AI ethics policy mandates that generative AI applications conform to the three fundamental pillars – “Human Agency & Oversight”, “Addressing Bias & Discrimination” and lastly “Transparency & Explainability”.

Figure 9: Pillars of SAP’s AI ethics policy

SAP SuccessFactors has an AI acknowledgment framework that the business users will acknowledge before using/viewing any AI capability. Additionally, the product standards and risk assessment process mandate that all personally identifiable data is handled appropriately and anonymized. Data access controls are applied by existing role-based permissions, and insight data available is restricted to what the user has access to. Based on application type and the data processed, AI use cases are internally categorized by risk level and undergo a review by the SAP AI ethics steering committee to address any potential concerns related to bias and data safety. The AI-assisted compensation discussion use case has had thorough reviews with the SAP ethics steering committee. The review process has resulted in a strengthened warning notice and improvements of both the prompt and insights content to reduce bias.

The improvements involve steering clear of presumed performance, avoiding generic statements, and creating actionable, bulleted points. Following our best practices, users can view all the data sources that are used to generate insights, ensuring explainability and transparency. This process is described in detail as part of SAP AI ethics handbook.

At the same time, we remain compliant with the European Union AI Act on artificial intelligence, which lists manipulation of cognitive behavior, social scoring, biometric identification people, and their categorization based on it, as “unacceptable risks”. Furthermore, there are well defined transparency requirements towards general purpose AI systems, to be considered. Our solutions, including the AI-assisted compensation review, passes these tests, thanks to the above-mentioned stringent AI ethics review.

Conclusion

At SAP, our engineering best practices are tailored to develop ethical, resilient, and scalable generative AI applications. This synergy enables us to unlock the full potential of generative AI within our business applications and platform to enhance efficiency and productivity for our customers.

In this blog post, we illustrated this by exploring the critical intersection of SAP SuccessFactors business domain knowledge and generative AI expertise. We shared examples of engineering best practices applied during the development of AI-assisted compensation discussion scenario. We detailed the best practices by presenting their quantitative effectiveness from various perspectives, including relevance, robustness, reliability, and bias. Additionally, we outlined the AI ethics review that this use case has been subjected to.

Co-authored by Dr. Shu Zhen, Gayatri Gopalakrishnan, Dr. Jan Dumke, and Agarwal, Akhil

Unlocking the Potential of Business AI: Engineering Best Practices

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win